[ https://issues.apache.org/jira/browse/MAPREDUCE-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137182#comment-13137182 ]
Robert Joseph Evans commented on MAPREDUCE-3274: ------------------------------------------------ Yes the JVM thing was a red herring. The issue is that on the AM. The events were processed in the following order CONTINER_REMOTE_LAUNCH TA_KILL But on the NM they were processed in reverse. Stop Container Request (Error) Start Container Request (Success) The Stop Request was processed 4 ms before the Start Request was. I need to read through the code some more to try to understand how to handle this. Just my gut feeling would be that we need a way to handle an error in a Stop Container Request. We may need an event back indicating that the TA_KILL failed. Perhaps we could retry it a few times before giving up instead of the event back. Also the container launched and started talking to the App Master requesting something to do. The App Master always responded with I have nothing for you to do. It might be good if the code that informs the Container what to do could know about killed attempts and if for some reason they ask for something to do they are told to die. This seems like a good way to prevent this type of error in the future. > Race condition in MR App Master Preemtion can cause a dead lock > --------------------------------------------------------------- > > Key: MAPREDUCE-3274 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3274 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, scheduler > Affects Versions: 0.23.0, 0.24.0 > Reporter: Robert Joseph Evans > Assignee: Robert Joseph Evans > Priority: Critical > Fix For: 0.23.0, 0.24.0 > > > There appears to be a race condition in the MR App Master in relation to > preempting reducers to let a mapper run. In the particular case that I have > been debugging a reducer was selected for preemption that did not have a > container assigned to it yet. When the container became available that reduce > started running and the previous TA_KILL event appears to have been ignored. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira