[ https://issues.apache.org/jira/browse/MAPREDUCE-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137214#comment-13137214 ]
Vinod Kumar Vavilapalli commented on MAPREDUCE-3274: ---------------------------------------------------- That is one monster of a race! I think the problem is this: Today we treat REMOTE_LAUNCH and REMOTE_CLEANUP events for the same container as distinct unrelated events in ContainerLauncherImpl. We need to handle them as related, and take action depending on whether container is launched or not. Code fixes in MAPREDUCE-3240 for the NodeManager immediately come to my mind which are similar but for cleaning up container processes on NM. bq. It might be good if the code that informs the Container what to do could know about killed attempts and if for some reason they ask for something to do they are told to die. The infrastructure is already there for doing this. It is supposed to work if not for bugs :) See TaskAttempListenerImpl (+411) which dishes out tasks, it is supposed to ask them to die if it doesn't know them. Two things we can do for this: - TaskAttempt should register with TaskAttemptListener even *before* the container is launched. Today the registration happens only after the container launches. - It should register with TaskAttemptListener.taskHeartBeatHandler *after* the container is launched so that heartBeatHandler doesn't start counting down even before the container is launched. - And of course, fix the obvious bug, that is send a DIE to the task, if it is not registered with TaskAttemptListener. > Race condition in MR App Master Preemtion can cause a dead lock > --------------------------------------------------------------- > > Key: MAPREDUCE-3274 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3274 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 > Affects Versions: 0.23.0, 0.24.0 > Reporter: Robert Joseph Evans > Assignee: Robert Joseph Evans > Priority: Blocker > Fix For: 0.23.0, 0.24.0 > > > There appears to be a race condition in the MR App Master in relation to > preempting reducers to let a mapper run. In the particular case that I have > been debugging a reducer was selected for preemption that did not have a > container assigned to it yet. When the container became available that reduce > started running and the previous TA_KILL event appears to have been ignored. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira