[
https://issues.apache.org/jira/browse/MAPREDUCE-5079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13627311#comment-13627311
]
Siddharth Seth commented on MAPREDUCE-5079:
-------------------------------------------
Overall, looks pretty good. Comments and mostly minor nits
- TaskAttemptImpl.needToClean - actual cleanup seems to be buried inside the
"KILLED" code block. Does this need to be moved out ?
- JobImpl.startTansition - there's an unnecessary null check on 'jse'
- processRecovery can be changed to check the appAttemptId up front, and change
the logged message. The message saying "Will not try to recover" on attempt 1
can be a little confusing.
- JobHistoryUtils is likely a better place for getPreviousJobHistoryFileStream
- TaskImpl.sendTaskSucceededEvent - has a log message about killing remaining
containers. This isn't needed here.
- Conversion from attaemptState to TaskAttemptCompletionEventStatus - this is a
direct string conversion based on the fact that only a select set of values are
used (FAILED/KILLED/SUCCEEDED). Would prefer this conversion being explicit,
with an error on an incorrect state. This won't happen now, but may simplify
things for future changes.
- MRJobConfig.MR_AM_JOB_RECOVERY_ENABLE - can we add a default value in
MRJobConfig instead of leaving it in the MRAppMaster.
For the unit tests,
- some formatting fixes are required. > 80 column width, spaces etc.
- testRecoveryTaskSuccessAllAttemptsFail could also verify that a new attempt
is created
- testRecoveryTaskSuccessAllAttemptsSucceed could verify the correct event
being sent to the job as a CompletionEvent
- ApplicationId, JobId construction via BuilderUtils, MRBuilderUtils instead of
directly
- when(mockTAinfo.getState()).thenReturn("RUNNING"); //?? - RUNNING can be
replaced with a random string to be less confusing. This is a user process
reported string afaik.
> Recovery should restore task state from job history info directly
> -----------------------------------------------------------------
>
> Key: MAPREDUCE-5079
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5079
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: mr-am
> Affects Versions: 0.23.7
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Priority: Critical
> Attachments: MAPREDUCE-5079-branch-0.23.patch, MAPREDUCE-5079.patch,
> MAPREDUCE-5079.patch, MAPREDUCE-5079.patch
>
>
> We've encountered a lot of hanging issues during MR-AM recovery because the
> state machines don't always end up in the same states after recovery. This
> is especially true when speculative execution is enabled. It should be
> straightforward to restore task and task attempt states directly from the
> TaskInfo and TaskAttemptInfo records in the job history file to avoid relying
> on the task state machines ending up in the proper states with the proper
> number of attempts.
> This should be a more robust solution that would also give us the option of
> recovering start time and log locations for tasks that were in-progress when
> the AM crashed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira