Robert Joseph Evans created MAPREDUCE-4833:
----------------------------------------------
Summary: Task can get stuck in FAIL_CONTAINER_CLEANUP
Key: MAPREDUCE-4833
URL: https://issues.apache.org/jira/browse/MAPREDUCE-4833
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: applicationmaster, mrv2
Affects Versions: 0.23.5
Reporter: Robert Joseph Evans
Priority: Critical
If an NM goes down and the AM still tries to launch a container on it the
ContainerLauncherImpl can get stuck in an RPC timeout. At the same time the RM
may notice that the NM has gone away and inform the AM of this, this triggers a
TA_FAILMSG. If the TA_FAILMSG arrives at the TaskAttemptImpl before the
TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try to kill the
container, but the ContainerLauncherImpl will not send back a
TA_CONTAINER_CLEANED event causing the attempt to be stuck.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira