[
https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543334#comment-13543334
]
Siddharth Seth commented on MAPREDUCE-4832:
-------------------------------------------
Was talking to Hitesh offline about this patch. Is this needed at the moment ?
Seems like it's possible to avoid multiple AMs by tuning the
AM_LIVENESS_INTERVAL (10 minutes by default) and MR_AM_TO_RM_WAIT_INTERVAL_MS
(6 minutes by default). A new AM should only be started after the existing AM
is done.
That said, this is definitely an interesting approach to fix the problem.
- Could add a check to ensure the window interval is greater than the AM-RM
heartbeat.
- Does getClock() need to be part of the RMHeartbeatHandler. Looks like the
AppContext can provide this - I think a couple of places use the AppContext,
others use th RMHeartbeatHandler.
Recovery and restart are still WIP. I believe the MR_AM_TO_RM_WAIT_INTERVAL_MS
will need to be looked at again in context of recovery. This patch, or a sync
via hdfs seems more useful at that point ?
> MR AM can get in a split brain situation
> ----------------------------------------
>
> Key: MAPREDUCE-4832
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster
> Affects Versions: 2.0.2-alpha, 0.23.5
> Reporter: Robert Joseph Evans
> Assignee: Jason Lowe
> Priority: Critical
> Attachments: MAPREDUCE-4832.patch
>
>
> It is possible for a networking issue to happen where the RM thinks an AM has
> gone down and launches a replacement, but the previous AM is still up and
> running. If the previous AM does not need any more resources from the RM it
> could try to commit either tasks or jobs. This could cause lots of problems
> where the second AM finishes and tries to commit too. This could result in
> data corruption.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira