[
https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543727#comment-13543727
]
Siddharth Seth commented on MAPREDUCE-4832:
-------------------------------------------
bq. AM is running on a node whose NM suddenly declares itself UNHEALTHY via
health-check script
Right, there's multiple ways in which an AM may time out - and this specific
case can lead to multiple AMs, so a fix is required.
I'm +1 for the updated patch.
> MR AM can get in a split brain situation
> ----------------------------------------
>
> Key: MAPREDUCE-4832
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster
> Affects Versions: 2.0.2-alpha, 0.23.5
> Reporter: Robert Joseph Evans
> Assignee: Jason Lowe
> Priority: Critical
> Attachments: MAPREDUCE-4832.patch, MAPREDUCE-4832.patch
>
>
> It is possible for a networking issue to happen where the RM thinks an AM has
> gone down and launches a replacement, but the previous AM is still up and
> running. If the previous AM does not need any more resources from the RM it
> could try to commit either tasks or jobs. This could cause lots of problems
> where the second AM finishes and tries to commit too. This could result in
> data corruption.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira