[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607299#comment-13607299
 ] 

Siddharth Seth commented on MAPREDUCE-5079:
-------------------------------------------

I like the approach as well. Having to explicitly change recovery to intercept 
additional events when J/T/TA transitions start generating additional events 
can get complicated fast.
Considering this is a reasonable change, the patch is simpler than expected - 
bonus for the approach.
Haven't taken a comprehensive look at the patch, but some comments (some are 
likely separate jiras). 
- Handling FAILED / KILLED tasks from previous runs (Some jobs allow a 
percentage of tasks to fail)
- For Successful tasks, if recovery fails for the successful attempt (committer 
failure, etc) - should this be considered as a failure, and count towards the 
max-attempt limit ?
- Speculator info from the previous run could be recovered as well.

                
> Recovery should restore task state from job history info directly
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-5079
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5079
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am
>    Affects Versions: 0.23.7
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: MAPREDUCE-5079.patch
>
>
> We've encountered a lot of hanging issues during MR-AM recovery because the 
> state machines don't always end up in the same states after recovery.  This 
> is especially true when speculative execution is enabled.  It should be 
> straightforward to restore task and task attempt states directly from the 
> TaskInfo and TaskAttemptInfo records in the job history file to avoid relying 
> on the task state machines ending up in the proper states with the proper 
> number of attempts.
> This should be a more robust solution that would also give us the option of 
> recovering start time and log locations for tasks that were in-progress when 
> the AM crashed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to