[ 
https://issues.apache.org/jira/browse/TEZ-4501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated TEZ-4501:
------------------------------
    Description: 
The relevant part from AsyncDispatcher is:
{code}
      synchronized (waitForDrained) {
        while (!drained && eventHandlingThread.isAlive() && 
System.currentTimeMillis() < endTime) {
          waitForDrained.wait(1000);
          LOG.info(
              "Waiting for AsyncDispatcher to drain. Current queue size: {}, 
handler thread state: {}",
              eventQueue.size(), eventHandlingThread.getState());
        }
      }
{code}

Here is a log snippet from TestLocalMode when looping for 
TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT simply caused a test timeout because 
drained was false.

{code}
2023-07-03 12:15:39,938 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to stop, 
ignoring any new events.
2023-07-03 12:15:39,938 INFO  [CallbackExecutor] common.AsyncDispatcher 
(AsyncDispatcher.java:handle(369)) - AsyncDispatcher put event: EventType: 
C_COMPLETED
2023-07-03 12:15:39,938 WARN  [CallbackExecutor] common.AsyncDispatcher 
(AsyncDispatcher.java:handle(373)) - AsyncDispatcher thread interrupted (while 
putting event): EventType: C_COMPLETED
java.lang.InterruptedException
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223)
        at 
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
        at 
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
        at 
org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370)
        at 
org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97)
        at 
org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357)
        at 
com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Exception in thread "CallbackExecutor" 
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.InterruptedException
        at 
org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:375)
        at 
org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97)
        at 
org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357)
        at 
com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223)
        at 
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
        at 
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
        at 
org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370)
        ... 6 more
2023-07-03 12:15:40,938 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:41,939 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:42,939 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:43,939 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:44,940 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:45,940 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:46,940 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:47,941 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:48,941 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:49,149 INFO  [Time-limited test] service.AbstractService 
(AbstractService.java:noteFailure(267)) - Service Central failed in state 
STOPPED
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at 
org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:155)
{code}

eventQueue.size == 0 and still looping, this is only possible if [drained is 
false: it was set in 
GenericEventHandler.handle|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L338]

> Fix TestLocalMode timeouts
> --------------------------
>
>                 Key: TEZ-4501
>                 URL: https://issues.apache.org/jira/browse/TEZ-4501
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The relevant part from AsyncDispatcher is:
> {code}
>       synchronized (waitForDrained) {
>         while (!drained && eventHandlingThread.isAlive() && 
> System.currentTimeMillis() < endTime) {
>           waitForDrained.wait(1000);
>           LOG.info(
>               "Waiting for AsyncDispatcher to drain. Current queue size: {}, 
> handler thread state: {}",
>               eventQueue.size(), eventHandlingThread.getState());
>         }
>       }
> {code}
> Here is a log snippet from TestLocalMode when looping for 
> TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT simply caused a test timeout because 
> drained was false.
> {code}
> 2023-07-03 12:15:39,938 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to 
> stop, ignoring any new events.
> 2023-07-03 12:15:39,938 INFO  [CallbackExecutor] common.AsyncDispatcher 
> (AsyncDispatcher.java:handle(369)) - AsyncDispatcher put event: EventType: 
> C_COMPLETED
> 2023-07-03 12:15:39,938 WARN  [CallbackExecutor] common.AsyncDispatcher 
> (AsyncDispatcher.java:handle(373)) - AsyncDispatcher thread interrupted 
> (while putting event): EventType: C_COMPLETED
> java.lang.InterruptedException
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223)
>       at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>       at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
>       at 
> org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370)
>       at 
> org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97)
>       at 
> org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357)
>       at 
> com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> Exception in thread "CallbackExecutor" 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
>       at 
> org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:375)
>       at 
> org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97)
>       at 
> org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357)
>       at 
> com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.InterruptedException
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223)
>       at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>       at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
>       at 
> org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370)
>       ... 6 more
> 2023-07-03 12:15:40,938 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:41,939 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:42,939 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:43,939 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:44,940 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:45,940 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:46,940 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:47,941 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:48,941 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:49,149 INFO  [Time-limited test] service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service Central failed in state 
> STOPPED
> java.lang.InterruptedException
>       at java.lang.Object.wait(Native Method)
>       at 
> org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:155)
> {code}
> eventQueue.size == 0 and still looping, this is only possible if [drained is 
> false: it was set in 
> GenericEventHandler.handle|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L338]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to