[ https://issues.apache.org/jira/browse/TEZ-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702420#comment-14702420 ]
Jeff Zhang commented on TEZ-2687: --------------------------------- Upload patch to fix it. [~hitesh] [~bikassaha] Please help review. * Release the held containers before the DAGAppMaster#stopServices is called * Release the container at once if new container is allocated when it is shutting down. Simulate the ATS hang behavior by adding Thread.sleep in HistoryEventHandler#serviceStop Here's the log without this patch. Container will only be released when it is expired. {noformat} 2015-08-19 12:05:01,645 INFO [IPC Server handler 0 on 49920] app.DAGAppMaster: DAGAppMasterShutdownHandler invoked 2015-08-19 12:05:01,645 INFO [IPC Server handler 0 on 49920] app.DAGAppMaster: Handling DAGAppMaster shutdown 2015-08-19 12:05:01,645 INFO [AMShutdownThread] app.DAGAppMaster: Sleeping for 5 seconds before shutting down 2015-08-19 12:05:06,646 INFO [AMShutdownThread] app.DAGAppMaster: Calling stop for all the services 2015-08-19 12:05:06,647 INFO [AMShutdownThread] history.HistoryEventHandler: Stopping HistoryEventHandler 2015-08-19 12:05:23,083 INFO [DelayedContainerManager] rm.YarnTaskSchedulerService: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_1439946425329_0022_01_000003, containerExpiryTime=1439957123078, idleTimeout=20000, taskRequestsCount=0, heldContainers=4, delayedContainers=3, isNew=false 2015-08-19 12:05:23,083 INFO [DelayedContainerManager] rm.YarnTaskSchedulerService: Releasing unused container: container_1439946425329_0022_01_000003 2015-08-19 12:05:23,083 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1439946425329_0022_1][Event:CONTAINER_STOPPED]: containerId=container_1439946425329_0022_01_000003, stoppedTime=1439957123083, exitStatus=0 2015-08-19 12:05:23,083 INFO [Dispatcher thread: Central] container.AMContainerImpl: AMContainer container_1439946425329_0022_01_000003 transitioned from IDLE to STOP_REQUESTED via event C_STOP_REQUEST 2015-08-19 12:05:23,083 INFO [ContainerLauncher #6] launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_STOP_REQUEST 2015-08-19 12:05:23,084 INFO [ContainerLauncher #6] launcher.ContainerLauncherImpl: Sending a stop request to the NM for ContainerId: container_1439946425329_0022_01_000003 2015-08-19 12:05:23,084 INFO [ContainerLauncher #6] impl.ContainerManagementProtocolProxy: Opening proxy : 192.168.3.3:50421 2015-08-19 12:05:23,090 INFO [Dispatcher thread: Central] container.AMContainerImpl: AMContainer container_1439946425329_0022_01_000003 transitioned from STOP_REQUESTED to STOPPING via event C_NM_STOP_SENT 2015-08-19 12:05:23,222 INFO [IPC Server handler 1 on 49919] app.TaskAttemptListenerImpTezDag: Container with id: container_1439946425329_0022_01_000003 is valid, but no longer registered, and will be killed 2015-08-19 12:05:23,373 INFO [AMRM Callback Handler Thread] rm.YarnTaskSchedulerService: Released container completed:container_1439946425329_0022_01_000003 last allocated to task: attempt_1439946425329_0022_1_02_000003_0 2015-08-19 12:05:23,373 INFO [Dispatcher thread: Central] container.AMContainerImpl: Container container_1439946425329_0022_01_000003 exited with diagnostics set to Container failed, exitCode=-100. Container released by application 2015-08-19 12:05:23,373 INFO [Dispatcher thread: Central] container.AMContainerImpl: AMContainer container_1439946425329_0022_01_000003 transitioned from STOPPING to COMPLETED via event C_COMPLETED {noformat} Here's the log with this patch, containers will be released explicitly when shutdown is invoked. {noformat} 2015-08-19 12:07:37,137 INFO [IPC Server handler 0 on 50138] app.DAGAppMaster: DAGAppMasterShutdownHandler invoked 2015-08-19 12:07:37,137 INFO [IPC Server handler 0 on 50138] app.DAGAppMaster: Handling DAGAppMaster shutdown 2015-08-19 12:07:37,138 INFO [AMShutdownThread] app.DAGAppMaster: Sleeping for 5 seconds before shutting down 2015-08-19 12:07:42,139 INFO [AMShutdownThread] app.DAGAppMaster: Calling stop for all the services 2015-08-19 12:07:42,139 INFO [AMShutdownThread] rm.YarnTaskSchedulerService: Realease held containers 2015-08-19 12:07:42,139 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1439946425329_0023_1][Event:CONTAINER_STOPPED]: containerId=container_1439946425329_0023_01_000005, stoppedTime=1439957262139, exitStatus=0 2015-08-19 12:07:42,139 INFO [Dispatcher thread: Central] container.AMContainerImpl: AMContainer container_1439946425329_0023_01_000005 transitioned from IDLE to STOP_REQUESTED via event C_STOP_REQUEST 2015-08-19 12:07:42,139 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1439946425329_0023_1][Event:CONTAINER_STOPPED]: containerId=container_1439946425329_0023_01_000006, stoppedTime=1439957262139, exitStatus=0 2015-08-19 12:07:42,139 INFO [Dispatcher thread: Central] container.AMContainerImpl: AMContainer container_1439946425329_0023_01_000006 transitioned from IDLE to STOP_REQUESTED via event C_STOP_REQUEST 2015-08-19 12:07:42,140 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1439946425329_0023_1][Event:CONTAINER_STOPPED]: containerId=container_1439946425329_0023_01_000003, stoppedTime=1439957262139, exitStatus=0 2015-08-19 12:07:42,140 INFO [Dispatcher thread: Central] container.AMContainerImpl: AMContainer container_1439946425329_0023_01_000003 transitioned from IDLE to STOP_REQUESTED via event C_STOP_REQUEST 2015-08-19 12:07:42,140 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1439946425329_0023_1][Event:CONTAINER_STOPPED]: containerId=container_1439946425329_0023_01_000004, stoppedTime=1439957262140, exitStatus=0 2015-08-19 12:07:42,140 INFO [AMShutdownThread] history.HistoryEventHandler: Stopping HistoryEventHandler 2015-08-19 12:07:42,140 INFO [Dispatcher thread: Central] container.AMContainerImpl: AMContainer container_1439946425329_0023_01_000004 transitioned from IDLE to STOP_REQUESTED via event C_STOP_REQUEST 2015-08-19 12:07:42,140 INFO [ContainerLauncher #6] launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_STOP_REQUEST 2015-08-19 12:07:42,140 INFO [ContainerLauncher #7] launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_STOP_REQUEST 2015-08-19 12:07:42,141 INFO [ContainerLauncher #6] launcher.ContainerLauncherImpl: Sending a stop request to the NM for ContainerId: container_1439946425329_0023_01_000005 2015-08-19 12:07:42,141 INFO [ContainerLauncher #7] launcher.ContainerLauncherImpl: Sending a stop request to the NM for ContainerId: container_1439946425329_0023_01_000006 {noformat} > ATS History shutdown happens before the min-held containers are released > ------------------------------------------------------------------------ > > Key: TEZ-2687 > URL: https://issues.apache.org/jira/browse/TEZ-2687 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.6.2, 0.8.0, 0.7.1 > Reporter: Gopal V > Assignee: Jeff Zhang > Attachments: TEZ-2687-1.patch > > > When ATS goes into a GC pause under heavy loads and while it recovers, each > Tez AM holds onto a few containers even though it is shutting down and will > never accept any more DAGs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)