[ https://issues.apache.org/jira/browse/MAPREDUCE-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13138438#comment-13138438 ]
Robert Joseph Evans commented on MAPREDUCE-3274: ------------------------------------------------ I have an initial patch for asking containers to die if the AM doesn't expect them to be alive. I am going to try and do some more extensive testing on it to be sure it fixes the deadlock before I declare victory and submit it. I am also trying to understand how or even if to handle CONTAINER_REMOTE_LAUNCH and CONTAINER_REMOTE_CLEANUP as related in ContainerLauncherImpl. I am not even sure that it would make a difference. In the logs for the AM I can see that CONTAINER_REMOTE_LAUNCH for the errant attempt has returned successful before CONTAINER_REMOTE_CLEANUP is sent. The race appears to not be in the AM as much as it is in the NM. The NM will respond back that the container has been launched but it will not launch it yet. Launching appears to be something that can take quite a while to finish. If the CLEANUP arrives while that is happening the internal state of the NM does not know that the launch is in progress, and as such it fails. I think for us to tie the two together in the AM would also require us to tie the two even closer together in the NM as well. MAPREDUCE-3240 marks the process as launched by having it write out a PID file from the process itself. But that does not guarantee that once NM.startContainer has returned that NM.stopContainer will stop that container. It would require the NM to mark a container as being launched before the startContainer completes and it would require stopContainer to mark the container as needs to be stopped if it is in progress. All of this sounds like more work then can get done today. So I will try to test my patch without the changes. If everything seems to work then I will files a separate JIRA to address the atomicity issues with startContainer/stopContainer in the NM. > Race condition in MR App Master Preemtion can cause a dead lock > --------------------------------------------------------------- > > Key: MAPREDUCE-3274 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3274 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 > Affects Versions: 0.23.0, 0.24.0 > Reporter: Robert Joseph Evans > Assignee: Robert Joseph Evans > Priority: Blocker > Fix For: 0.23.0, 0.24.0 > > > There appears to be a race condition in the MR App Master in relation to > preempting reducers to let a mapper run. In the particular case that I have > been debugging a reducer was selected for preemption that did not have a > container assigned to it yet. When the container became available that reduce > started running and the previous TA_KILL event appears to have been ignored. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira