[ https://issues.apache.org/jira/browse/MAPREDUCE-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136445#comment-13136445 ]
Robert Joseph Evans commented on MAPREDUCE-3274: ------------------------------------------------ OK so it is a race condition. {noformat} attempt_1319242394842_0065_m_000000_0 is Launched (STATE RUNNING) Many other reducers are launched filling up the queues capacity attempt_1319242394842_0065_r_000008_0 is in the UNASSIGNED state waiting to be scheduled attempt_1319242394842_0065_m_000000_0 is killed for going over its memory limit attempt_1319242394842_0065_m_000000_0 is cleaned up and a replacement attempt_1319242394842_0065_m_000000_1 is added to be scheduled attempt_1319242394842_0065_r_000008_0 gets a container and goes to the ASSIGNED state. Preemption is triggered. attempt_1319242394842_0065_r_000008_0 is selected and is sent a TA_KILL event (the History Log ignores the event because it has not written out a START event for attempt_1319242394842_0065_r_000008_0 yet) attempt_1319242394842_0065_r_000008_0 transitions to KILLED, going through several other states attempt_1319242394842_0065_r_000008_1 is created to replace attempt_1319242394842_0065_r_000008_0 and moves to UNASSIGNED state Processing attempt_1319242394842_0065_r_000008_0 of type TA_CONTAINER_LAUNCHED (The container for the killed task is now launched) JVM with ID : jvm_1319242394842_0065_r_000008 asked for a task JVM with ID: jvm_1319242394842_0065_r_000008 given task: attempt_1319242394842_0065_r_000004_0 {noformat} So even though attempt_1319242394842_0065_r_000008_0 was killed, its container when it finally showed up was given to a different reduce attempt, and did not end up freeing up any resources at all. > Race condition in MR App Master Preemtion can cause a dead lock > --------------------------------------------------------------- > > Key: MAPREDUCE-3274 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3274 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, scheduler > Affects Versions: 0.23.0, 0.24.0 > Reporter: Robert Joseph Evans > Assignee: Robert Joseph Evans > Priority: Critical > Fix For: 0.23.0, 0.24.0 > > > There appears to be a race condition in the MR App Master in relation to > preempting reducers to let a mapper run. In the particular case that I have > been debugging a reducer was selected for preemption that did not have a > container assigned to it yet. When the container became available that reduce > started running and the previous TA_KILL event appears to have been ignored. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira