[
https://issues.apache.org/jira/browse/TEZ-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702420#comment-14702420
]
Jeff Zhang commented on TEZ-2687:
---------------------------------
Upload patch to fix it. [~hitesh] [~bikassaha] Please help review.
* Release the held containers before the DAGAppMaster#stopServices is called
* Release the container at once if new container is allocated when it is
shutting down.
Simulate the ATS hang behavior by adding Thread.sleep in
HistoryEventHandler#serviceStop
Here's the log without this patch. Container will only be released when it is
expired.
{noformat}
2015-08-19 12:05:01,645 INFO [IPC Server handler 0 on 49920] app.DAGAppMaster:
DAGAppMasterShutdownHandler invoked
2015-08-19 12:05:01,645 INFO [IPC Server handler 0 on 49920] app.DAGAppMaster:
Handling DAGAppMaster shutdown
2015-08-19 12:05:01,645 INFO [AMShutdownThread] app.DAGAppMaster: Sleeping for
5 seconds before shutting down
2015-08-19 12:05:06,646 INFO [AMShutdownThread] app.DAGAppMaster: Calling stop
for all the services
2015-08-19 12:05:06,647 INFO [AMShutdownThread] history.HistoryEventHandler:
Stopping HistoryEventHandler
2015-08-19 12:05:23,083 INFO [DelayedContainerManager]
rm.YarnTaskSchedulerService: No taskRequests. Container's idle timeout delay
expired or is new. Releasing container,
containerId=container_1439946425329_0022_01_000003,
containerExpiryTime=1439957123078, idleTimeout=20000, taskRequestsCount=0,
heldContainers=4, delayedContainers=3, isNew=false
2015-08-19 12:05:23,083 INFO [DelayedContainerManager]
rm.YarnTaskSchedulerService: Releasing unused container:
container_1439946425329_0022_01_000003
2015-08-19 12:05:23,083 INFO [Dispatcher thread: Central]
history.HistoryEventHandler:
[HISTORY][DAG:dag_1439946425329_0022_1][Event:CONTAINER_STOPPED]:
containerId=container_1439946425329_0022_01_000003, stoppedTime=1439957123083,
exitStatus=0
2015-08-19 12:05:23,083 INFO [Dispatcher thread: Central]
container.AMContainerImpl: AMContainer container_1439946425329_0022_01_000003
transitioned from IDLE to STOP_REQUESTED via event C_STOP_REQUEST
2015-08-19 12:05:23,083 INFO [ContainerLauncher #6]
launcher.ContainerLauncherImpl: Processing the event EventType:
CONTAINER_STOP_REQUEST
2015-08-19 12:05:23,084 INFO [ContainerLauncher #6]
launcher.ContainerLauncherImpl: Sending a stop request to the NM for
ContainerId: container_1439946425329_0022_01_000003
2015-08-19 12:05:23,084 INFO [ContainerLauncher #6]
impl.ContainerManagementProtocolProxy: Opening proxy : 192.168.3.3:50421
2015-08-19 12:05:23,090 INFO [Dispatcher thread: Central]
container.AMContainerImpl: AMContainer container_1439946425329_0022_01_000003
transitioned from STOP_REQUESTED to STOPPING via event C_NM_STOP_SENT
2015-08-19 12:05:23,222 INFO [IPC Server handler 1 on 49919]
app.TaskAttemptListenerImpTezDag: Container with id:
container_1439946425329_0022_01_000003 is valid, but no longer registered, and
will be killed
2015-08-19 12:05:23,373 INFO [AMRM Callback Handler Thread]
rm.YarnTaskSchedulerService: Released container
completed:container_1439946425329_0022_01_000003 last allocated to task:
attempt_1439946425329_0022_1_02_000003_0
2015-08-19 12:05:23,373 INFO [Dispatcher thread: Central]
container.AMContainerImpl: Container container_1439946425329_0022_01_000003
exited with diagnostics set to Container failed, exitCode=-100. Container
released by application
2015-08-19 12:05:23,373 INFO [Dispatcher thread: Central]
container.AMContainerImpl: AMContainer container_1439946425329_0022_01_000003
transitioned from STOPPING to COMPLETED via event C_COMPLETED
{noformat}
Here's the log with this patch, containers will be released explicitly when
shutdown is invoked.
{noformat}
2015-08-19 12:07:37,137 INFO [IPC Server handler 0 on 50138] app.DAGAppMaster:
DAGAppMasterShutdownHandler invoked
2015-08-19 12:07:37,137 INFO [IPC Server handler 0 on 50138] app.DAGAppMaster:
Handling DAGAppMaster shutdown
2015-08-19 12:07:37,138 INFO [AMShutdownThread] app.DAGAppMaster: Sleeping for
5 seconds before shutting down
2015-08-19 12:07:42,139 INFO [AMShutdownThread] app.DAGAppMaster: Calling stop
for all the services
2015-08-19 12:07:42,139 INFO [AMShutdownThread] rm.YarnTaskSchedulerService:
Realease held containers
2015-08-19 12:07:42,139 INFO [Dispatcher thread: Central]
history.HistoryEventHandler:
[HISTORY][DAG:dag_1439946425329_0023_1][Event:CONTAINER_STOPPED]:
containerId=container_1439946425329_0023_01_000005, stoppedTime=1439957262139,
exitStatus=0
2015-08-19 12:07:42,139 INFO [Dispatcher thread: Central]
container.AMContainerImpl: AMContainer container_1439946425329_0023_01_000005
transitioned from IDLE to STOP_REQUESTED via event C_STOP_REQUEST
2015-08-19 12:07:42,139 INFO [Dispatcher thread: Central]
history.HistoryEventHandler:
[HISTORY][DAG:dag_1439946425329_0023_1][Event:CONTAINER_STOPPED]:
containerId=container_1439946425329_0023_01_000006, stoppedTime=1439957262139,
exitStatus=0
2015-08-19 12:07:42,139 INFO [Dispatcher thread: Central]
container.AMContainerImpl: AMContainer container_1439946425329_0023_01_000006
transitioned from IDLE to STOP_REQUESTED via event C_STOP_REQUEST
2015-08-19 12:07:42,140 INFO [Dispatcher thread: Central]
history.HistoryEventHandler:
[HISTORY][DAG:dag_1439946425329_0023_1][Event:CONTAINER_STOPPED]:
containerId=container_1439946425329_0023_01_000003, stoppedTime=1439957262139,
exitStatus=0
2015-08-19 12:07:42,140 INFO [Dispatcher thread: Central]
container.AMContainerImpl: AMContainer container_1439946425329_0023_01_000003
transitioned from IDLE to STOP_REQUESTED via event C_STOP_REQUEST
2015-08-19 12:07:42,140 INFO [Dispatcher thread: Central]
history.HistoryEventHandler:
[HISTORY][DAG:dag_1439946425329_0023_1][Event:CONTAINER_STOPPED]:
containerId=container_1439946425329_0023_01_000004, stoppedTime=1439957262140,
exitStatus=0
2015-08-19 12:07:42,140 INFO [AMShutdownThread] history.HistoryEventHandler:
Stopping HistoryEventHandler
2015-08-19 12:07:42,140 INFO [Dispatcher thread: Central]
container.AMContainerImpl: AMContainer container_1439946425329_0023_01_000004
transitioned from IDLE to STOP_REQUESTED via event C_STOP_REQUEST
2015-08-19 12:07:42,140 INFO [ContainerLauncher #6]
launcher.ContainerLauncherImpl: Processing the event EventType:
CONTAINER_STOP_REQUEST
2015-08-19 12:07:42,140 INFO [ContainerLauncher #7]
launcher.ContainerLauncherImpl: Processing the event EventType:
CONTAINER_STOP_REQUEST
2015-08-19 12:07:42,141 INFO [ContainerLauncher #6]
launcher.ContainerLauncherImpl: Sending a stop request to the NM for
ContainerId: container_1439946425329_0023_01_000005
2015-08-19 12:07:42,141 INFO [ContainerLauncher #7]
launcher.ContainerLauncherImpl: Sending a stop request to the NM for
ContainerId: container_1439946425329_0023_01_000006
{noformat}
> ATS History shutdown happens before the min-held containers are released
> ------------------------------------------------------------------------
>
> Key: TEZ-2687
> URL: https://issues.apache.org/jira/browse/TEZ-2687
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.6.2, 0.8.0, 0.7.1
> Reporter: Gopal V
> Assignee: Jeff Zhang
> Attachments: TEZ-2687-1.patch
>
>
> When ATS goes into a GC pause under heavy loads and while it recovers, each
> Tez AM holds onto a few containers even though it is shutting down and will
> never accept any more DAGs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)