[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13138438#comment-13138438
 ] 

Robert Joseph Evans commented on MAPREDUCE-3274:
------------------------------------------------

I have an initial patch for asking containers to die if the AM doesn't expect 
them to be alive.  I am going to try and do some more extensive testing on it 
to be sure it fixes the deadlock before I declare victory and submit it.

I am also trying to understand how or even if to handle CONTAINER_REMOTE_LAUNCH 
and CONTAINER_REMOTE_CLEANUP as related in ContainerLauncherImpl.  I am not 
even sure that it would make a difference.  In the logs for the AM I can see 
that CONTAINER_REMOTE_LAUNCH for the errant attempt has returned successful 
before CONTAINER_REMOTE_CLEANUP is sent.  The race appears to not be in the AM 
as much as it is in the NM.  The NM will respond back that the container has 
been launched but it will not launch it yet.  Launching appears to be something 
that can take quite a while to finish.  If the CLEANUP arrives while that is 
happening the internal state of the NM does not know that the launch is in 
progress, and as such it fails.  I think for us to tie the two together in the 
AM would also require us to tie the two even closer together in the NM as well. 
 MAPREDUCE-3240 marks the process as launched by having it write out a PID file 
from the process itself.  But that does not guarantee that once 
NM.startContainer has returned that NM.stopContainer will stop that container.  
It would require the NM to mark a container as being launched before the 
startContainer completes and it would require stopContainer to mark the 
container as needs to be stopped if it is in progress.

All of this sounds like more work then can get done today.  So I will try to 
test my patch without the changes.  If everything seems to work then I will 
files a separate JIRA to address the atomicity issues with 
startContainer/stopContainer in the NM.
                
> Race condition in MR App Master Preemtion can cause a dead lock
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-3274
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3274
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>             Fix For: 0.23.0, 0.24.0
>
>
> There appears to be a race condition in the MR App Master in relation to 
> preempting reducers to let a mapper run.  In the particular case that I have 
> been debugging a reducer was selected for preemption that did not have a 
> container assigned to it yet. When the container became available that reduce 
> started running and the previous TA_KILL event appears to have been ignored.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to