NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
----------------------------------------------------------------------------
Key: MAPREDUCE-3738
URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: mrv2, nodemanager
Affects Versions: 0.23.1, 0.24.0
Reporter: Jason Lowe
If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception like
OutOfMemoryError in the case I saw) then this will lead to a hang during
nodemanager shutdown. The NM calls AppLogAggregatorImpl.join() during shutdown
to make sure log aggregation has completed, and that method internally waits
for an atomic boolean to be set by the log aggregation thread to indicate it
has finished. Since the thread was killed off earlier due to an uncaught
exception, the boolean will never be set and the NM hangs during shutdown
repeating something like this every second in the log file:
2012-01-25 22:20:56,366 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
Waiting for aggregation to complete for application_1326848182580_2806
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira