[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137214#comment-13137214
 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-3274:
----------------------------------------------------

That is one monster of a race!

I think the problem is this: Today we treat REMOTE_LAUNCH and REMOTE_CLEANUP 
events for the same container as distinct unrelated events in 
ContainerLauncherImpl. We need to handle them as related, and take action 
depending on whether container is launched or not. Code fixes in MAPREDUCE-3240 
for the NodeManager immediately come to my mind which are similar but for 
cleaning up container processes on NM.

bq. It might be good if the code that informs the Container what to do could 
know about killed attempts and if for some reason they ask for something to do 
they are told to die.
The infrastructure is already there for doing this. It is supposed to work if 
not for bugs :) See TaskAttempListenerImpl (+411) which dishes out tasks, it is 
supposed to ask them to die if it doesn't know them. Two things we can do for 
this:
 - TaskAttempt should register with TaskAttemptListener even *before* the 
container is launched. Today the registration happens only after the container 
launches.
 - It should register with TaskAttemptListener.taskHeartBeatHandler *after* the 
container is launched so that heartBeatHandler doesn't start counting down even 
before the container is launched.
 - And of course, fix the obvious bug, that is send a DIE to the task, if it is 
not registered with TaskAttemptListener.
                
> Race condition in MR App Master Preemtion can cause a dead lock
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-3274
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3274
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>             Fix For: 0.23.0, 0.24.0
>
>
> There appears to be a race condition in the MR App Master in relation to 
> preempting reducers to let a mapper run.  In the particular case that I have 
> been debugging a reducer was selected for preemption that did not have a 
> container assigned to it yet. When the container became available that reduce 
> started running and the previous TA_KILL event appears to have been ignored.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to