[ 
https://issues.apache.org/jira/browse/TEZ-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor resolved TEZ-4277.
-------------------------------
    Resolution: Fixed

> AsyncDispatcher can hang on serviceStop if the eventhandling thread is in 
> BLOCKED state
> ---------------------------------------------------------------------------------------
>
>                 Key: TEZ-4277
>                 URL: https://issues.apache.org/jira/browse/TEZ-4277
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>             Fix For: 0.10.1
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> While further testing TEZ-4236, I discovered a strange hang in 
> AsyncDispatcher. The symptom was a BLOCKED state event-handling thread. This 
> block is most likely caused by the dispatcher itself, so instead of making 
> yet another sync hack in the dispatcher, or even DrainDispatcher (however 
> they have a common codebase for serviceStop at the moment), I'm about to put 
> a safety check, which solved my issue.
> jstack of a hang (it hangs 
> [here|https://github.com/apache/tez/blob/master/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L157])
> {code}
> "Thread-1212" #1581 prio=5 os_prio=31 tid=0x00007fdb4c628800 nid=0x1b84f in 
> Object.wait() [0x000070000997b000]
> java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1252)
> - locked <0x00000007ae967340> (a java.lang.Thread)
> at java.lang.Thread.join(Thread.java:1326)
> at org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:157)
> at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
> - locked <0x00000007ae642cc8> (a java.lang.Object)
> at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54)
> at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102)
> at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:67)
> at org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1933)
> at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:2155)
> - locked <0x00000007ae53d3c0> (a org.apache.tez.dag.app.LocalDAGAppMaster)
> at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
> - locked <0x00000007ae53d5f0> (a java.lang.Object)
> at org.apache.tez.client.LocalClient.close(LocalClient.java:135)
> at org.apache.tez.client.TezClient.stop(TezClient.java:775)
> - locked <0x00000007ade485a0> (a org.apache.tez.client.TezClient)
> at 
> org.apache.tez.test.TestLocalMode.testMultipleClientsWithSession(TestLocalMode.java:147)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {code}
> during the investigation I added some log messages, that lead to this 
> solution:
> 1. there is not a single event in the queue
> 2. event handling thread is in BLOCKED state
> 3. AM is already shut down, but test hangs as LocalClient.stop waits to AM 
> services to be stopped
> {code}
> 2021-01-29 16:16:44,350 INFO  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to 
> stop, ignoring any new events.
> ...
> 2021-01-29 16:16:45,354 INFO  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to 
> drain...current events count: 0, handler thread state: BLOCKED
> 2021-01-29 16:16:46,358 INFO  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to 
> drain...current events count: 0, handler thread state: BLOCKED
> 2021-01-29 16:16:47,362 INFO  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to 
> drain...current events count: 0, handler thread state: BLOCKED
> ...
> 2021-01-29 16:16:48,365 INFO  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to 
> drain...current events count: 0, handler thread state: BLOCKED
> 2021-01-29 16:16:49,349 INFO  [AMShutdownThread] app.DAGAppMaster 
> (DAGAppMaster.java:run(1002)) - Calling stop for all the services
> 2021-01-29 16:16:49,349 INFO  [AMShutdownThread] app.DAGAppMaster 
> (DAGAppMaster.java:run(1013)) - Exiting DAGAppMaster..GoodBye!
> 2021-01-29 16:16:49,370 INFO  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to 
> drain...current events count: 0, handler thread state: BLOCKED
> 2021-01-29 16:16:49,370 WARN  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(153)) - Cannot wait for drain events 
> anymore, logging current events if any...
> 2021-01-29 16:16:49,370 INFO  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:logCurrentEvents(186)) - Current events in the queue: 0
> 2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(168)) - interrupting event handling thread, 
> state: BLOCKED
> 2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(170)) - interrupted event handling thread, 
> state: BLOCKED
> 2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(172)) - joining event handling thread, 
> state: BLOCKED
> {code}
> under normal circumstances the event handling thread is in WAITING state 
> (which I think is because of queue.take)
> + I also included the timeout logic of YARN-3999 here



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to