[ 
https://issues.apache.org/jira/browse/TEZ-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702420#comment-14702420
 ] 

Jeff Zhang commented on TEZ-2687:
---------------------------------

Upload patch to fix it.  [~hitesh] [~bikassaha] Please help review. 
* Release the held containers before the DAGAppMaster#stopServices is called
* Release the container at once if new container is allocated when it is 
shutting down.

Simulate the ATS hang behavior by adding Thread.sleep in 
HistoryEventHandler#serviceStop
Here's the log without this patch. Container will only be released when it is 
expired. 
{noformat}
2015-08-19 12:05:01,645 INFO [IPC Server handler 0 on 49920] app.DAGAppMaster: 
DAGAppMasterShutdownHandler invoked
2015-08-19 12:05:01,645 INFO [IPC Server handler 0 on 49920] app.DAGAppMaster: 
Handling DAGAppMaster shutdown
2015-08-19 12:05:01,645 INFO [AMShutdownThread] app.DAGAppMaster: Sleeping for 
5 seconds before shutting down
2015-08-19 12:05:06,646 INFO [AMShutdownThread] app.DAGAppMaster: Calling stop 
for all the services
2015-08-19 12:05:06,647 INFO [AMShutdownThread] history.HistoryEventHandler: 
Stopping HistoryEventHandler
2015-08-19 12:05:23,083 INFO [DelayedContainerManager] 
rm.YarnTaskSchedulerService: No taskRequests. Container's idle timeout delay 
expired or is new. Releasing container, 
containerId=container_1439946425329_0022_01_000003, 
containerExpiryTime=1439957123078, idleTimeout=20000, taskRequestsCount=0, 
heldContainers=4, delayedContainers=3, isNew=false
2015-08-19 12:05:23,083 INFO [DelayedContainerManager] 
rm.YarnTaskSchedulerService: Releasing unused container: 
container_1439946425329_0022_01_000003
2015-08-19 12:05:23,083 INFO [Dispatcher thread: Central] 
history.HistoryEventHandler: 
[HISTORY][DAG:dag_1439946425329_0022_1][Event:CONTAINER_STOPPED]: 
containerId=container_1439946425329_0022_01_000003, stoppedTime=1439957123083, 
exitStatus=0
2015-08-19 12:05:23,083 INFO [Dispatcher thread: Central] 
container.AMContainerImpl: AMContainer container_1439946425329_0022_01_000003 
transitioned from IDLE to STOP_REQUESTED via event C_STOP_REQUEST
2015-08-19 12:05:23,083 INFO [ContainerLauncher #6] 
launcher.ContainerLauncherImpl: Processing the event EventType: 
CONTAINER_STOP_REQUEST
2015-08-19 12:05:23,084 INFO [ContainerLauncher #6] 
launcher.ContainerLauncherImpl: Sending a stop request to the NM for 
ContainerId: container_1439946425329_0022_01_000003
2015-08-19 12:05:23,084 INFO [ContainerLauncher #6] 
impl.ContainerManagementProtocolProxy: Opening proxy : 192.168.3.3:50421
2015-08-19 12:05:23,090 INFO [Dispatcher thread: Central] 
container.AMContainerImpl: AMContainer container_1439946425329_0022_01_000003 
transitioned from STOP_REQUESTED to STOPPING via event C_NM_STOP_SENT
2015-08-19 12:05:23,222 INFO [IPC Server handler 1 on 49919] 
app.TaskAttemptListenerImpTezDag: Container with id: 
container_1439946425329_0022_01_000003 is valid, but no longer registered, and 
will be killed
2015-08-19 12:05:23,373 INFO [AMRM Callback Handler Thread] 
rm.YarnTaskSchedulerService: Released container 
completed:container_1439946425329_0022_01_000003 last allocated to task: 
attempt_1439946425329_0022_1_02_000003_0
2015-08-19 12:05:23,373 INFO [Dispatcher thread: Central] 
container.AMContainerImpl: Container container_1439946425329_0022_01_000003 
exited with diagnostics set to Container failed, exitCode=-100. Container 
released by application
2015-08-19 12:05:23,373 INFO [Dispatcher thread: Central] 
container.AMContainerImpl: AMContainer container_1439946425329_0022_01_000003 
transitioned from STOPPING to COMPLETED via event C_COMPLETED
{noformat}

Here's the log with this patch, containers will be released explicitly when 
shutdown is invoked. 
{noformat}
2015-08-19 12:07:37,137 INFO [IPC Server handler 0 on 50138] app.DAGAppMaster: 
DAGAppMasterShutdownHandler invoked
2015-08-19 12:07:37,137 INFO [IPC Server handler 0 on 50138] app.DAGAppMaster: 
Handling DAGAppMaster shutdown
2015-08-19 12:07:37,138 INFO [AMShutdownThread] app.DAGAppMaster: Sleeping for 
5 seconds before shutting down
2015-08-19 12:07:42,139 INFO [AMShutdownThread] app.DAGAppMaster: Calling stop 
for all the services
2015-08-19 12:07:42,139 INFO [AMShutdownThread] rm.YarnTaskSchedulerService: 
Realease held containers
2015-08-19 12:07:42,139 INFO [Dispatcher thread: Central] 
history.HistoryEventHandler: 
[HISTORY][DAG:dag_1439946425329_0023_1][Event:CONTAINER_STOPPED]: 
containerId=container_1439946425329_0023_01_000005, stoppedTime=1439957262139, 
exitStatus=0
2015-08-19 12:07:42,139 INFO [Dispatcher thread: Central] 
container.AMContainerImpl: AMContainer container_1439946425329_0023_01_000005 
transitioned from IDLE to STOP_REQUESTED via event C_STOP_REQUEST
2015-08-19 12:07:42,139 INFO [Dispatcher thread: Central] 
history.HistoryEventHandler: 
[HISTORY][DAG:dag_1439946425329_0023_1][Event:CONTAINER_STOPPED]: 
containerId=container_1439946425329_0023_01_000006, stoppedTime=1439957262139, 
exitStatus=0
2015-08-19 12:07:42,139 INFO [Dispatcher thread: Central] 
container.AMContainerImpl: AMContainer container_1439946425329_0023_01_000006 
transitioned from IDLE to STOP_REQUESTED via event C_STOP_REQUEST
2015-08-19 12:07:42,140 INFO [Dispatcher thread: Central] 
history.HistoryEventHandler: 
[HISTORY][DAG:dag_1439946425329_0023_1][Event:CONTAINER_STOPPED]: 
containerId=container_1439946425329_0023_01_000003, stoppedTime=1439957262139, 
exitStatus=0
2015-08-19 12:07:42,140 INFO [Dispatcher thread: Central] 
container.AMContainerImpl: AMContainer container_1439946425329_0023_01_000003 
transitioned from IDLE to STOP_REQUESTED via event C_STOP_REQUEST
2015-08-19 12:07:42,140 INFO [Dispatcher thread: Central] 
history.HistoryEventHandler: 
[HISTORY][DAG:dag_1439946425329_0023_1][Event:CONTAINER_STOPPED]: 
containerId=container_1439946425329_0023_01_000004, stoppedTime=1439957262140, 
exitStatus=0
2015-08-19 12:07:42,140 INFO [AMShutdownThread] history.HistoryEventHandler: 
Stopping HistoryEventHandler
2015-08-19 12:07:42,140 INFO [Dispatcher thread: Central] 
container.AMContainerImpl: AMContainer container_1439946425329_0023_01_000004 
transitioned from IDLE to STOP_REQUESTED via event C_STOP_REQUEST
2015-08-19 12:07:42,140 INFO [ContainerLauncher #6] 
launcher.ContainerLauncherImpl: Processing the event EventType: 
CONTAINER_STOP_REQUEST
2015-08-19 12:07:42,140 INFO [ContainerLauncher #7] 
launcher.ContainerLauncherImpl: Processing the event EventType: 
CONTAINER_STOP_REQUEST
2015-08-19 12:07:42,141 INFO [ContainerLauncher #6] 
launcher.ContainerLauncherImpl: Sending a stop request to the NM for 
ContainerId: container_1439946425329_0023_01_000005
2015-08-19 12:07:42,141 INFO [ContainerLauncher #7] 
launcher.ContainerLauncherImpl: Sending a stop request to the NM for 
ContainerId: container_1439946425329_0023_01_000006
{noformat}

> ATS History shutdown happens before the min-held containers are released
> ------------------------------------------------------------------------
>
>                 Key: TEZ-2687
>                 URL: https://issues.apache.org/jira/browse/TEZ-2687
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.2, 0.8.0, 0.7.1
>            Reporter: Gopal V
>            Assignee: Jeff Zhang
>         Attachments: TEZ-2687-1.patch
>
>
> When ATS goes into a GC pause under heavy loads and while it recovers, each 
> Tez AM holds onto a few containers even though it is shutting down and will 
> never accept any more DAGs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to