[ https://issues.apache.org/jira/browse/MAPREDUCE-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137197#comment-13137197 ]
Robert Joseph Evans commented on MAPREDUCE-3274: ------------------------------------------------ Of Course. >From the AM Logs {noformat} m_0 NEW -> SCHEDULED r_8 NEW -> SCHEDULED m_0_0 NEW -> UNASSIGNED r_8_0 NEW -> UNASSIGNED cont_1_2 = m_0_0 m_0_0 UNASSIGNED -> ASSIGNED CONTAINER_REMOTE_LAUNCH for m_0_0 TA_CONTAINER_LAUNCHED for m_0_0 m_0_0 ASSIGNED -> RUNNING m_0 SCHEDULED -> RUNNING jvm_m_2 = m_0_0 m_0_0 RUNNING -> FAIL_CONTAINER_CLEANUP m_0_0 FAILED_CONTAINER_CLEANUP -> FAILED_TASK_CLEANUP m_0_0 FAILED_TASK_CLEANUP -> FAILED m_0_1 NEW -> UNASSIGNED cont_1_11 = r_8_0 r_8_0 UNASSIGNED -> ASSIGNED CONTAINER_REMOTE_LAUNCH for r_8_0 preempting r_8_0 TA_KILL for r_8_0 r_8_0 ASSIGNED -> KILL_CONTAINER_CLEANUP CONTAINER_REMOTE_CLEANUP for r_8_0 TA_CONTAINER_CLEANED for r_8_0 r_8_0 KILL_CONTAINER_CLEANUP -> KILL_TASK_CLEANUP r_8_0 KILL_TASK_CLEANUP -> KILLED r_8_1 NEW -> UNASSIGNED ****** r_8_0 TA_CONTAINER_LAUNCHED ****** {noformat} Inside the job logs for cont_1_11, it constantly calls getTask and has null returned for it. NM Logs for cont_1_11 (Scrubbed a bit) {noformat} 2011-10-22 09:38:06,137 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Trying to stop unknown container container_1319242394842_0065_01_000011 2011-10-22 09:38:06,138 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=container_1319242394842_0065_01_000008 OPERATION=Stop Container Request TARGET=ContainerManagerImpl RESULT=FAILURE DESCRIPTION=Trying to stop unknown container! APPID=application_1319242394842_0065 CONTAINERID=container_1319242394842_0065_01_000011 2011-10-22 09:38:06,142 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hadoop OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1319242394842_0065 CONTAINERID=container_1319242394842_0065_01_000011 2011-10-22 09:38:06,142 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Adding container_1319242394842_0065_01_000011 to application application_1319242394842_0065 2011-10-22 09:38:06,142 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1319242394842_0065_01_000011 of type INIT_CONTAINER 2011-10-22 09:38:06,143 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1319242394842_0065_01_000011 transitioned from NEW to LOCALIZING 2011-10-22 09:38:06,143 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1319242394842_0065_01_000011 of type RESOURCE_LOCALIZED 2011-10-22 09:38:06,143 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1319242394842_0065_01_000011 of type RESOURCE_LOCALIZED 2011-10-22 09:38:06,143 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1319242394842_0065_01_000011 transitioned from LOCALIZING to LOCALIZED 2011-10-22 09:38:06,273 INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: launchContainer: [container-executor, hadoop, 1, application_1319242394842_0065, container_1319242394842_0065_01_000011, application_1319242394842_0065/container_1319242394842_0065_01_000011, application_1319242394842_0065/container_1319242394842_0065_01_000011/task.sh, container_1319242394842_0065_01_000011/container_1319242394842_0065_01_000011.tokens] 2011-10-22 09:38:06,305 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1319242394842_0065_01_000011 of type CONTAINER_LAUNCHED 2011-10-22 09:38:06,305 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1319242394842_0065_01_000011 transitioned from LOCALIZED to RUNNING 2011-10-22 09:38:07,613 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1319242394842_0065_01_000011 2011-10-22 09:38:07,658 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 5856 for container-id container_1319242394842_0065_01_000011 : Virtual 1246031872 bytes, limit : 2147483648 bytes; Physical 50659328 bytes, limit -1 bytes {noformat} The last like just repeats until the JOB is killed > Race condition in MR App Master Preemtion can cause a dead lock > --------------------------------------------------------------- > > Key: MAPREDUCE-3274 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3274 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 > Affects Versions: 0.23.0, 0.24.0 > Reporter: Robert Joseph Evans > Assignee: Robert Joseph Evans > Priority: Blocker > Fix For: 0.23.0, 0.24.0 > > > There appears to be a race condition in the MR App Master in relation to > preempting reducers to let a mapper run. In the particular case that I have > been debugging a reducer was selected for preemption that did not have a > container assigned to it yet. When the container became available that reduce > started running and the previous TA_KILL event appears to have been ignored. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira