[
https://issues.apache.org/jira/browse/MAPREDUCE-873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sharad Agarwal updated MAPREDUCE-873:
-------------------------------------
Status: Patch Available (was: Open)
> Simplify Job Recovery
> ---------------------
>
> Key: MAPREDUCE-873
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-873
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: jobtracker
> Affects Versions: 0.20.1
> Reporter: Devaraj Das
> Assignee: Sharad Agarwal
> Fix For: 0.21.0
>
> Attachments: 873_v1.patch, 873_v2.patch, 873_v3.patch
>
>
> On a couple of occasions we have seen the JobTracker not being able to handle
> job recovery well, and leading to cluster downtime after a restart. The
> current design for handling job recovery is complex and prone to corner cases
> not being handled well enough. In retrospect, it seems like the transaction
> log based approach as was proposed on HADOOP-3245
> (http://tinyurl.com/luh9hb), would have been a better/simpler model. However,
> that is a big project, and it seems for the medium term, just handling job
> re-submissions after a restart is a good tradeoff. That is, the JobTracker
> after getting restarted, will resubmit all jobs that were running in its past
> life. They will all start from the beginning (downside is completed tasks
> will reexecute). In the long term, the transaction log model or some variant
> of that should be pursued.
> Thoughts/comments welcome.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.