[
https://issues.apache.org/jira/browse/MAPREDUCE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13872340#comment-13872340
]
Karthik Kambatla commented on MAPREDUCE-5718:
---------------------------------------------
Thanks for chiming in, Jason.
Please correct me if I am wrong. Not being able to tolerate node failures
(slaves/master) seems like a major regression from MR1 which tolerates slave
failures. I am wondering if there is a way to solve the crashed commits issue
not just for all jobs. For MR, what do you think of committing to an
intermediate location, and renaming it to the output location? If the output
location is missing, the commit can be retried.
> MR AM should tolerate RM restart/failover during commit
> -------------------------------------------------------
>
> Key: MAPREDUCE-5718
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5718
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mr-am
> Affects Versions: 2.4.0
> Reporter: Karthik Kambatla
> Assignee: Karthik Kambatla
> Labels: ha
> Attachments: mr-5718-0.patch
>
>
> While testing RM HA, we ran into this issue where if the RM fails over while
> an MR AM is in the middle of a commit, the subsequent AM gets spawned but
> dies with a diagnostic message - "We crashed durring a commit".
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)