[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543334#comment-13543334
 ] 

Siddharth Seth commented on MAPREDUCE-4832:
-------------------------------------------

Was talking to Hitesh offline about this patch. Is this needed at the moment ? 
Seems like it's possible to avoid multiple AMs by tuning the 
AM_LIVENESS_INTERVAL (10 minutes by default) and MR_AM_TO_RM_WAIT_INTERVAL_MS 
(6 minutes by default). A new AM should only be started after the existing AM 
is done.
 
That said, this is definitely an interesting approach to fix the problem.
- Could add a check to ensure the window interval is greater than the AM-RM 
heartbeat.
- Does getClock() need to be part of the RMHeartbeatHandler. Looks like the 
AppContext can provide this - I think a couple of places use the AppContext, 
others use th RMHeartbeatHandler.

Recovery and restart are still WIP. I believe the  MR_AM_TO_RM_WAIT_INTERVAL_MS 
will need to be looked at again in context of recovery. This patch, or a sync 
via hdfs seems more useful at that point ?
                
> MR AM can get in a split brain situation
> ----------------------------------------
>
>                 Key: MAPREDUCE-4832
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Robert Joseph Evans
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4832.patch
>
>
> It is possible for a networking issue to happen where the RM thinks an AM has 
> gone down and launches a replacement, but the previous AM is still up and 
> running.  If the previous AM does not need any more resources from the RM it 
> could try to commit either tasks or jobs.  This could cause lots of problems 
> where the second AM finishes and tries to commit too.  This could result in 
> data corruption.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to