[ 
https://issues.apache.org/jira/browse/HADOOP-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508160
 ] 

Alejandro Abdelnur commented on HADOOP-1121:
--------------------------------------------

Reviving this issue (hoping it will make it for 0.14).

I'd like to hear opinions on my last comments.

On #3 (OutputFormat, existing dirs and deleting), I had to do some homework on 
my side.

Building on Doug's idea in his first comment, the temporary output directory 
idea would do:

Something like:

* OutputFormat changes:
  * Change getRecordWrite() contract - it should return a non-existing temp 
directory and keep track of it.
  * Add a method done() - it should move/rename the temp directory to the 
output directory.

* JobTracker changes:
  * On start up it should clean up the temp directories area.
  * On job completion it should call done() on the OutputFormat instance.

If this seems alright I can create a separate issue for this.


> Recovering running/scheduled jobs after JobTracker failure
> ----------------------------------------------------------
>
>                 Key: HADOOP-1121
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1121
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: patch1121.txt
>
>
> Currently all running/scheduled jobs are kept in memory in the JobTracker. If 
> the JobTracker goes down all the running/scheduled jobs have to be 
> resubmitted.
> Proposal:
> (1) On job submission the JobTracker would save the job configuration 
> (job.xml) in a jobs DFS directory using the jobID as name.
> (2) On job completion (success, failure, klll) it would delete the job 
> configuration from the jobs DFS directory.
> (3) On JobTracker failure the jobs DFS directory will have all 
> running/scheduled jobs at failure time.
> (4) On startup the JobTracker would check the jobs DFS directory for job 
> config files. if there is none it means no failure happened on last stop, 
> there is nothing to be done. If there are job config files in the jobs DFS 
> directory continue with the following recovery steps.
> (A) rename all job config files to $JOB_CONFIG_FILE.recover.
> (B) for each $JOB_CONFIG_FILE.recover: delete the output directory if it 
> exists, schedule the job using the original job ID, delete the 
> $JOB_CONFIG_FILE.recover (as a new $JOB_CONFIG_FILE will be there per 
> scheduling (per step #1).
> (C) when B is completed start accepting new job submissions.
> Other details:
> A configuration flag would enable/disable the above behavior, if switched off 
> (default behavior) nothing of the above happens.
> A startup flag could switch off job recovery for systems with the recover set 
> to ON.
> Changes to the job ID generation should be put in place to avoid Job ID 
> collision with jobs IDs from previous failed runs, for example appending a JT 
> startup timestamp to the job IDs would do.
> Further improvements on top of this one:
> This mechanism would allow having a JobTracker node in standby to be started 
> in case of main JobTracker failure. The standby JobTracker would be started 
> on main JobTracker failure. Making things a little more comprehensive they 
> backup JobTrackers could be running in warm mode and hearbeats and ping calls 
> among them would activate a warm stand by JobTracker as new main JobTracker. 
> Together with an enhancement in the JobClient (keeping a list of backup 
> JobTracker URLs) would enable client fallback to backup JobTrackers.
> State about partially run jobs could be kept, tasks 
> completed/in-progress/pending. This would enable to recover jobs half way 
> instead restarting them. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to