[ 
https://issues.apache.org/jira/browse/TEZ-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated TEZ-4277:
------------------------------
    Description: 
While further testing TEZ-4236, I discovered a strange hang in AsyncDispatcher. 
The symptom was a BLOCKED state event-handling thread. This block is most 
likely caused by the dispatcher itself, so instead of making yet another sync 
hack in the dispatcher, or even DrainDispatcher (however they have a common 
codebase for serviceStop at the moment), I'm about to put a safety check, which 
solved my issue.


jstack of a hang (it hangs 
[here|https://github.com/apache/tez/blob/master/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L157])
{code}
"Thread-1212" #1581 prio=5 os_prio=31 tid=0x00007fdb4c628800 nid=0x1b84f in 
Object.wait() [0x000070000997b000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1252)
- locked <0x00000007ae967340> (a java.lang.Thread)
at java.lang.Thread.join(Thread.java:1326)
at org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:157)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
- locked <0x00000007ae642cc8> (a java.lang.Object)
at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54)
at 
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102)
at 
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:67)
at org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1933)
at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:2155)
- locked <0x00000007ae53d3c0> (a org.apache.tez.dag.app.LocalDAGAppMaster)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
- locked <0x00000007ae53d5f0> (a java.lang.Object)
at org.apache.tez.client.LocalClient.close(LocalClient.java:135)
at org.apache.tez.client.TezClient.stop(TezClient.java:775)
- locked <0x00000007ade485a0> (a org.apache.tez.client.TezClient)
at 
org.apache.tez.test.TestLocalMode.testMultipleClientsWithSession(TestLocalMode.java:147)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
{code}

during the investigation I added some log messages, that lead to this solution:

1. there is not a single event in the queue
2. event handling thread is in BLOCKED state
3. AM is already shut down, but test hangs as LocalClient.stop waits to AM 
services to be stopped

{code}
2021-01-29 16:16:44,350 INFO  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to stop, 
ignoring any new events.
...
2021-01-29 16:16:45,354 INFO  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to 
drain...current events count: 0, handler thread state: BLOCKED
2021-01-29 16:16:46,358 INFO  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to 
drain...current events count: 0, handler thread state: BLOCKED
2021-01-29 16:16:47,362 INFO  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to 
drain...current events count: 0, handler thread state: BLOCKED
...
2021-01-29 16:16:48,365 INFO  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to 
drain...current events count: 0, handler thread state: BLOCKED
2021-01-29 16:16:49,349 INFO  [AMShutdownThread] app.DAGAppMaster 
(DAGAppMaster.java:run(1002)) - Calling stop for all the services
2021-01-29 16:16:49,349 INFO  [AMShutdownThread] app.DAGAppMaster 
(DAGAppMaster.java:run(1013)) - Exiting DAGAppMaster..GoodBye!
2021-01-29 16:16:49,370 INFO  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to 
drain...current events count: 0, handler thread state: BLOCKED
2021-01-29 16:16:49,370 WARN  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(153)) - Cannot wait for drain events anymore, 
logging current events if any...
2021-01-29 16:16:49,370 INFO  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:logCurrentEvents(186)) - Current events in the queue: 0
2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(168)) - interrupting event handling thread, 
state: BLOCKED
2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(170)) - interrupted event handling thread, 
state: BLOCKED
2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(172)) - joining event handling thread, state: 
BLOCKED
{code}

under normal circumstances the event handling thread is in WAITING state (which 
I think is because of queue.take)

  was:
While further testing TEZ-4236, I discovered a strange hang in AsyncDispatcher. 
The symptom was a BLOCKED state event-handling thread. This block is most 
likely caused by the dispatcher itself, so instead of making yet another sync 
hack in the dispatcher, or even DrainDispatcher (however they have a common 
codebase for serviceStop at the moment), I'm about to put a safety check, which 
solved my issue.


jstack of a hang (it hangs 
[here|https://github.com/apache/tez/blob/master/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L157])
{code}
"Thread-1212" #1581 prio=5 os_prio=31 tid=0x00007fdb4c628800 nid=0x1b84f in 
Object.wait() [0x000070000997b000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1252)
- locked <0x00000007ae967340> (a java.lang.Thread)
at java.lang.Thread.join(Thread.java:1326)
at org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:157)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
- locked <0x00000007ae642cc8> (a java.lang.Object)
at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54)
at 
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102)
at 
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:67)
at org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1933)
at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:2155)
- locked <0x00000007ae53d3c0> (a org.apache.tez.dag.app.LocalDAGAppMaster)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
- locked <0x00000007ae53d5f0> (a java.lang.Object)
at org.apache.tez.client.LocalClient.close(LocalClient.java:135)
at org.apache.tez.client.TezClient.stop(TezClient.java:775)
- locked <0x00000007ade485a0> (a org.apache.tez.client.TezClient)
at 
org.apache.tez.test.TestLocalMode.testMultipleClientsWithSession(TestLocalMode.java:147)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
{code}

during the investigation I added some log messages, that lead to this solution:

1. there is not a single event in the queue
2. event handling thread is in BLOCKED state
3. AM is already shut down, but test hangs as LocalClient.stop waits to AM 
services to be stopped

{code}
2021-01-29 16:16:44,350 INFO  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to stop, 
ignoring any new events.
...
2021-01-29 16:16:45,354 INFO  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to 
drain...current events count: 0, handler thread state: BLOCKED
2021-01-29 16:16:46,358 INFO  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to 
drain...current events count: 0, handler thread state: BLOCKED
2021-01-29 16:16:47,362 INFO  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to 
drain...current events count: 0, handler thread state: BLOCKED
...
2021-01-29 16:16:48,365 INFO  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to 
drain...current events count: 0, handler thread state: BLOCKED
2021-01-29 16:16:49,349 INFO  [AMShutdownThread] app.DAGAppMaster 
(DAGAppMaster.java:run(1002)) - Calling stop for all the services
2021-01-29 16:16:49,349 INFO  [AMShutdownThread] app.DAGAppMaster 
(DAGAppMaster.java:run(1013)) - Exiting DAGAppMaster..GoodBye!
2021-01-29 16:16:49,370 INFO  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to 
drain...current events count: 0, handler thread state: BLOCKED
2021-01-29 16:16:49,370 WARN  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(153)) - Cannot wait for drain events anymore, 
logging current events if any...
2021-01-29 16:16:49,370 INFO  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:logCurrentEvents(186)) - Current events in the queue: 0
2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(168)) - interrupting event handling thread, 
state: BLOCKED
2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(170)) - interrupted event handling thread, 
state: BLOCKED
2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(172)) - joining event handling thread, state: 
BLOCKED
{code}


> AsyncDispatcher can hang on serviceStop if the eventhandling thread is in 
> BLOCKED state
> ---------------------------------------------------------------------------------------
>
>                 Key: TEZ-4277
>                 URL: https://issues.apache.org/jira/browse/TEZ-4277
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>             Fix For: 0.10.1
>
>
> While further testing TEZ-4236, I discovered a strange hang in 
> AsyncDispatcher. The symptom was a BLOCKED state event-handling thread. This 
> block is most likely caused by the dispatcher itself, so instead of making 
> yet another sync hack in the dispatcher, or even DrainDispatcher (however 
> they have a common codebase for serviceStop at the moment), I'm about to put 
> a safety check, which solved my issue.
> jstack of a hang (it hangs 
> [here|https://github.com/apache/tez/blob/master/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L157])
> {code}
> "Thread-1212" #1581 prio=5 os_prio=31 tid=0x00007fdb4c628800 nid=0x1b84f in 
> Object.wait() [0x000070000997b000]
> java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1252)
> - locked <0x00000007ae967340> (a java.lang.Thread)
> at java.lang.Thread.join(Thread.java:1326)
> at org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:157)
> at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
> - locked <0x00000007ae642cc8> (a java.lang.Object)
> at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54)
> at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102)
> at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:67)
> at org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1933)
> at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:2155)
> - locked <0x00000007ae53d3c0> (a org.apache.tez.dag.app.LocalDAGAppMaster)
> at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
> - locked <0x00000007ae53d5f0> (a java.lang.Object)
> at org.apache.tez.client.LocalClient.close(LocalClient.java:135)
> at org.apache.tez.client.TezClient.stop(TezClient.java:775)
> - locked <0x00000007ade485a0> (a org.apache.tez.client.TezClient)
> at 
> org.apache.tez.test.TestLocalMode.testMultipleClientsWithSession(TestLocalMode.java:147)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {code}
> during the investigation I added some log messages, that lead to this 
> solution:
> 1. there is not a single event in the queue
> 2. event handling thread is in BLOCKED state
> 3. AM is already shut down, but test hangs as LocalClient.stop waits to AM 
> services to be stopped
> {code}
> 2021-01-29 16:16:44,350 INFO  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to 
> stop, ignoring any new events.
> ...
> 2021-01-29 16:16:45,354 INFO  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to 
> drain...current events count: 0, handler thread state: BLOCKED
> 2021-01-29 16:16:46,358 INFO  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to 
> drain...current events count: 0, handler thread state: BLOCKED
> 2021-01-29 16:16:47,362 INFO  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to 
> drain...current events count: 0, handler thread state: BLOCKED
> ...
> 2021-01-29 16:16:48,365 INFO  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to 
> drain...current events count: 0, handler thread state: BLOCKED
> 2021-01-29 16:16:49,349 INFO  [AMShutdownThread] app.DAGAppMaster 
> (DAGAppMaster.java:run(1002)) - Calling stop for all the services
> 2021-01-29 16:16:49,349 INFO  [AMShutdownThread] app.DAGAppMaster 
> (DAGAppMaster.java:run(1013)) - Exiting DAGAppMaster..GoodBye!
> 2021-01-29 16:16:49,370 INFO  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to 
> drain...current events count: 0, handler thread state: BLOCKED
> 2021-01-29 16:16:49,370 WARN  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(153)) - Cannot wait for drain events 
> anymore, logging current events if any...
> 2021-01-29 16:16:49,370 INFO  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:logCurrentEvents(186)) - Current events in the queue: 0
> 2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(168)) - interrupting event handling thread, 
> state: BLOCKED
> 2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(170)) - interrupted event handling thread, 
> state: BLOCKED
> 2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(172)) - joining event handling thread, 
> state: BLOCKED
> {code}
> under normal circumstances the event handling thread is in WAITING state 
> (which I think is because of queue.take)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to