[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607646#comment-13607646
 ] 

Jason Lowe commented on MAPREDUCE-5079:
---------------------------------------

Thanks for taking a look at the patch, Sidd.

bq. Handling FAILED / KILLED tasks from previous runs (Some jobs allow a 
percentage of tasks to fail)
Agreed.  In the short-term to mitigate some additional risk it only tries to 
recover the same set of tasks as before.  I'd prefer to handle this in a 
separate JIRA, but it should be easy to do it here as well.  In addition to 
FAILED/KILLED tasks, we could also recover information for tasks that were 
RUNNING, marking their in-flight attempts as KILLED but we'd at least have 
their start times, the nodes they ran on, and pointers to their logs.

bq. For Successful tasks, if recovery fails for the successful attempt 
(committer failure, etc) - should this be considered as a failure, and count 
towards the max-attempt limit ?
I debated this a bit when I wrote it and thought I'd rather give the task the 
benefit of the doubt and let it try again rather than fail it.  Recovery isn't 
a "normal" part of the task flow, and I thought it would be better to give the 
task another attempt rather than use up one of the failed attempts if recovery 
encounters an error.  I don't have strong feelings on it though.  If the 
consensus is that it should count as an attempt failure then is a 
straightforward change to mark it as such.

bq. Speculator info from the previous run could be recovered as well.
Yes, we should be able to reconstruct many, if not all, of the speculator 
events as well.  I'd prefer to defer that to a separate JIRA.
                
> Recovery should restore task state from job history info directly
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-5079
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5079
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am
>    Affects Versions: 0.23.7
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: MAPREDUCE-5079.patch
>
>
> We've encountered a lot of hanging issues during MR-AM recovery because the 
> state machines don't always end up in the same states after recovery.  This 
> is especially true when speculative execution is enabled.  It should be 
> straightforward to restore task and task attempt states directly from the 
> TaskInfo and TaskAttemptInfo records in the job history file to avoid relying 
> on the task state machines ending up in the proper states with the proper 
> number of attempts.
> This should be a more robust solution that would also give us the option of 
> recovering start time and log locations for tasks that were in-progress when 
> the AM crashed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to