[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13872189#comment-13872189
 ] 

Jason Lowe commented on MAPREDUCE-5718:
---------------------------------------

This is closely related to MAPREDUCE-5485.  The problem here is that the output 
committer is user-pluggable code, and we can't assume what it does or if it can 
be safely restarted after crashing mid-way through the commit.  This is one of 
the reasons job commits are not retried by the AM, and by extension we can't 
assume it's safe to retry in another AM attempt.  That's why the AM goes out of 
its way to indicate via a file that it's starting to do the job commit and 
avoids repeating it on an AM restart if that file is still present.  Whether 
the retry is because the AM crash or the AM was restarted due to RM restart, 
the end effect is the same -- it's not safe to retry a job commit in the 
general case.

If we had an API by which the output committer could tell the AM if it's safe 
to retry a job commit that would help.

> MR AM should tolerate RM restart/failover during commit
> -------------------------------------------------------
>
>                 Key: MAPREDUCE-5718
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5718
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>              Labels: ha
>         Attachments: mr-5718-0.patch
>
>
> While testing RM HA, we ran into this issue where if the RM fails over while 
> an MR AM is in the middle of a commit, the subsequent AM gets spawned but 
> dies with a diagnostic message - "We crashed durring a commit". 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to