Simplify Job Recovery
---------------------

                 Key: MAPREDUCE-873
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-873
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: jobtracker
    Affects Versions: 0.20.1
            Reporter: Devaraj Das
            Assignee: Sharad Agarwal
             Fix For: 0.21.0


On a couple of occasions we have seen the JobTracker not being able to handle 
job recovery well, and leading to cluster downtime after a restart. The current 
design for handling job recovery is complex and prone to corner cases not being 
handled well enough. In retrospect, it seems like the transaction log based 
approach as was proposed on HADOOP-3245 (http://tinyurl.com/luh9hb), would have 
been a better/simpler model. However, that is a big project, and it seems for 
the medium term, just handling job re-submissions after a restart is a good 
tradeoff. That is, the JobTracker after getting restarted, will resubmit all 
jobs that were running in its past life. They will all start from the beginning 
(downside is completed tasks will reexecute). In the long term, the transaction 
log model or some variant of that should be pursued.

Thoughts/comments welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to