[ https://issues.apache.org/jira/browse/HADOOP-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629080#action_12629080 ]
Amar Kamat commented on HADOOP-3245: ------------------------------------ One comment on the patch. _Approach :_ The way history renaming is done in this patch is as follows - given the job-id, job-name and the user-name, try to find out a file from the history folder that matches the pattern : jt-hostname_[0-9]*_jobid_jobname_username - if any file matches the pattern, say file _f_, then use _f.recover_ as the new file for history. If the file _f.recover_ is recovered, then rename _f.recover_ to _f_ and use _f.recover_ as the new file for history. - On successful recovery, delete _f_ - On job completion, rename _f.recover_ to _f_. - If the jt restarts in between, use the older file as the file for recovery. _Problem :_ With trunk, only 1 dfs access is made while starting the log process for a job. With this patch there will be 4 dfs accesses - Check if the job has a history _file_ _[ false for new jobs]_ - Check if _file_ exists _[false for new jobs]_ - Check if _file.recovery_ exists _[false for new jobs]_ - Open _file_ for logging I think it makes more sense to create a new job file upon every restart. Before starting the recovery process, delete all the history files related to the job except the oldest file. Note that the history filename has timestamp in it so that detecting the oldest file will now easy. _Example :_ Say that the job started with the timestamp t1. The job history filename would be _hostname_t1_jobid_jobname_username_. Upon restart, delete all the file related to job except the oldest file. Now new filename would be _hostname_t2_jobid_jobname_username_. Use _hostname_t1_jobid_jobname_username_ as the source for recovery. If the jobtracker dies while recovering then there will be 2 history file for the job, delete _hostname_t2_jobid_jobname_username_ upon recovery and use _hostname_t1_jobid_jobname_username_ for recovery. If the recovery is successful, delete _hostname_t1_jobid_jobname_username_ just to make sure that the latest history file will be used upon next restart. There is no renaming and no temp file involved in this approach. Note that at a given time there will be at the max 2 history files per job. > Provide ability to persist running jobs (extend HADOOP-1876) > ------------------------------------------------------------ > > Key: HADOOP-3245 > URL: https://issues.apache.org/jira/browse/HADOOP-3245 > Project: Hadoop Core > Issue Type: New Feature > Components: mapred > Reporter: Devaraj Das > Assignee: Amar Kamat > Attachments: HADOOP-3245-v2.5.patch, HADOOP-3245-v2.6.5.patch, > HADOOP-3245-v2.6.9.patch, HADOOP-3245-v4.1.patch, HADOOP-3245-v5.13.patch, > HADOOP-3245-v5.14.patch, HADOOP-3245-v5.26.patch, > HADOOP-3245-v5.30-nolog.patch, HADOOP-3245-v5.31.3-nolog.patch, > HADOOP-3245-v5.33.1.patch, HADOOP-3245-v5.35.3-no-log.patch > > > This could probably extend the work done in HADOOP-1876. This feature can be > applied for things like jobs being able to survive jobtracker restarts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.