Robert Joseph Evans created MAPREDUCE-4833:
----------------------------------------------

             Summary: Task can get stuck in FAIL_CONTAINER_CLEANUP
                 Key: MAPREDUCE-4833
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4833
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: applicationmaster, mrv2
    Affects Versions: 0.23.5
            Reporter: Robert Joseph Evans
            Priority: Critical


If an NM goes down and the AM still tries to launch a container on it the 
ContainerLauncherImpl can get stuck in an RPC timeout.  At the same time the RM 
may notice that the NM has gone away and inform the AM of this, this triggers a 
TA_FAILMSG.  If the TA_FAILMSG arrives at the TaskAttemptImpl before the 
TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try to kill the 
container, but the ContainerLauncherImpl will not send back a 
TA_CONTAINER_CLEANED event causing the attempt to be stuck.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to