MR AM hangs when one node goes bad
----------------------------------
Key: MAPREDUCE-3228
URL: https://issues.apache.org/jira/browse/MAPREDUCE-3228
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: applicationmaster, mrv2
Affects Versions: 0.23.0
Reporter: Vinod Kumar Vavilapalli
Priority: Blocker
Fix For: 0.23.0
Found this on one of the gridmix runs, again. One of the nodes went real bad,
the job had three containers running on the node. Eventually, AM marked the
tasks as timedout and initiated cleanup of the failed containers via
{{stopContainer()}}. The later got stuck at the faulty node, the tasks are
stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for
ever.
Thanks to [~Karams] for helping with this.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira