[
https://issues.apache.org/jira/browse/MAPREDUCE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869750#comment-13869750
]
Karthik Kambatla commented on MAPREDUCE-5718:
---------------------------------------------
The failure comes from the following snippet. If the previous AM has started
the commit, but has not succeeded or failed, we assume it is an error state.
{code}
if (commitSuccess) {
shutDownMessage = "We crashed after successfully committing.
Recovering.";
forcedState = JobStateInternal.SUCCEEDED;
} else if (commitFailure) {
shutDownMessage = "We crashed after a commit failure.";
forcedState = JobStateInternal.FAILED;
} else {
//The commit is still pending, commit error
shutDownMessage = "We crashed durring a commit";
forcedState = JobStateInternal.ERROR;
}
{code}
To fix this, we can do either of
# Treat the lack of success/failure file as an artifact of the previous commit
failing due to RM restart and re-attempt the commit. The only downside to this
seems to be when the commit itself is buggy - we ll end up trying to commit
upto the number of attempts allowed.
# Make sure the AM deletes the commit file before failing. Given the RM/ NM
kill the containers, making sure we delete the commit file before dying can be
a little more involved.
[~revans2] - do you think it is reasonable to go with the first option?
> MR AM should tolerate RM failover during commit
> -----------------------------------------------
>
> Key: MAPREDUCE-5718
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5718
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mr-am
> Affects Versions: 2.4.0
> Reporter: Karthik Kambatla
> Assignee: Karthik Kambatla
> Priority: Blocker
> Labels: ha
>
> While testing RM HA, we ran into this issue where if the RM fails over while
> an MR AM is in the middle of a commit, the subsequent AM gets spawned but
> dies with a diagnostic message - "We crashed durring a commit".
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)