[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13872340#comment-13872340
 ] 

Karthik Kambatla commented on MAPREDUCE-5718:
---------------------------------------------

Thanks for chiming in, Jason. 

Please correct me if I am wrong. Not being able to tolerate node failures 
(slaves/master) seems like a major regression from MR1 which tolerates slave 
failures. I am wondering if there is a way to solve the crashed commits issue 
not just for all jobs. For MR, what do you think of committing to an 
intermediate location, and renaming it to the output location? If the output 
location is missing, the commit can be retried.

> MR AM should tolerate RM restart/failover during commit
> -------------------------------------------------------
>
>                 Key: MAPREDUCE-5718
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5718
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>              Labels: ha
>         Attachments: mr-5718-0.patch
>
>
> While testing RM HA, we ran into this issue where if the RM fails over while 
> an MR AM is in the middle of a commit, the subsequent AM gets spawned but 
> dies with a diagnostic message - "We crashed durring a commit". 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to