[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13872366#comment-13872366
 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-5718:
----------------------------------------------------

bq. If we had an API by which the output committer could tell the AM if it's 
safe to retry a job commit that would help.
Jason, we already have the recover API in OutputCommitter. That should be 
enough right?

bq. Please correct me if I am wrong. Not being able to tolerate node failures 
(slaves/master) seems like a major regression from MR1 which tolerates slave 
failures.
This isn't a regression. More of an artifact of the new architecture. MR1 
doesn't tolerate the master failures. AM failure is akin to the master-failure, 
except the master is distributed now.

> MR AM should tolerate RM restart/failover during commit
> -------------------------------------------------------
>
>                 Key: MAPREDUCE-5718
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5718
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>              Labels: ha
>         Attachments: mr-5718-0.patch
>
>
> While testing RM HA, we ran into this issue where if the RM fails over while 
> an MR AM is in the middle of a commit, the subsequent AM gets spawned but 
> dies with a diagnostic message - "We crashed durring a commit". 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to