[
https://issues.apache.org/jira/browse/MAPREDUCE-5079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607646#comment-13607646
]
Jason Lowe commented on MAPREDUCE-5079:
---------------------------------------
Thanks for taking a look at the patch, Sidd.
bq. Handling FAILED / KILLED tasks from previous runs (Some jobs allow a
percentage of tasks to fail)
Agreed. In the short-term to mitigate some additional risk it only tries to
recover the same set of tasks as before. I'd prefer to handle this in a
separate JIRA, but it should be easy to do it here as well. In addition to
FAILED/KILLED tasks, we could also recover information for tasks that were
RUNNING, marking their in-flight attempts as KILLED but we'd at least have
their start times, the nodes they ran on, and pointers to their logs.
bq. For Successful tasks, if recovery fails for the successful attempt
(committer failure, etc) - should this be considered as a failure, and count
towards the max-attempt limit ?
I debated this a bit when I wrote it and thought I'd rather give the task the
benefit of the doubt and let it try again rather than fail it. Recovery isn't
a "normal" part of the task flow, and I thought it would be better to give the
task another attempt rather than use up one of the failed attempts if recovery
encounters an error. I don't have strong feelings on it though. If the
consensus is that it should count as an attempt failure then is a
straightforward change to mark it as such.
bq. Speculator info from the previous run could be recovered as well.
Yes, we should be able to reconstruct many, if not all, of the speculator
events as well. I'd prefer to defer that to a separate JIRA.
> Recovery should restore task state from job history info directly
> -----------------------------------------------------------------
>
> Key: MAPREDUCE-5079
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5079
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: mr-am
> Affects Versions: 0.23.7
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Attachments: MAPREDUCE-5079.patch
>
>
> We've encountered a lot of hanging issues during MR-AM recovery because the
> state machines don't always end up in the same states after recovery. This
> is especially true when speculative execution is enabled. It should be
> straightforward to restore task and task attempt states directly from the
> TaskInfo and TaskAttemptInfo records in the job history file to avoid relying
> on the task state machines ending up in the proper states with the proper
> number of attempts.
> This should be a more robust solution that would also give us the option of
> recovering start time and log locations for tasks that were in-progress when
> the AM crashed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira