Simplify Job Recovery
---------------------
Key: MAPREDUCE-873
URL: https://issues.apache.org/jira/browse/MAPREDUCE-873
Project: Hadoop Map/Reduce
Issue Type: Improvement
Components: jobtracker
Affects Versions: 0.20.1
Reporter: Devaraj Das
Assignee: Sharad Agarwal
Fix For: 0.21.0
On a couple of occasions we have seen the JobTracker not being able to handle
job recovery well, and leading to cluster downtime after a restart. The current
design for handling job recovery is complex and prone to corner cases not being
handled well enough. In retrospect, it seems like the transaction log based
approach as was proposed on HADOOP-3245 (http://tinyurl.com/luh9hb), would have
been a better/simpler model. However, that is a big project, and it seems for
the medium term, just handling job re-submissions after a restart is a good
tradeoff. That is, the JobTracker after getting restarted, will resubmit all
jobs that were running in its past life. They will all start from the beginning
(downside is completed tasks will reexecute). In the long term, the transaction
log model or some variant of that should be pursued.
Thoughts/comments welcome.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.