[ 
https://issues.apache.org/jira/browse/TEZ-4501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated TEZ-4501:
------------------------------
    Description: 
The relevant part from AsyncDispatcher is:
{code}
      synchronized (waitForDrained) {
        while (!drained && eventHandlingThread.isAlive() && 
System.currentTimeMillis() < endTime) {
          waitForDrained.wait(1000);
          LOG.info(
              "Waiting for AsyncDispatcher to drain. Current queue size: {}, 
handler thread state: {}",
              eventQueue.size(), eventHandlingThread.getState());
        }
      }
{code}

Here is a log snippet from TestLocalMode when looping for 
TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT simply caused a test timeout because 
drained was false.

{code}
2023-07-03 12:15:39,938 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to stop, 
ignoring any new events.
2023-07-03 12:15:39,938 INFO  [CallbackExecutor] common.AsyncDispatcher 
(AsyncDispatcher.java:handle(369)) - AsyncDispatcher put event: EventType: 
C_COMPLETED
2023-07-03 12:15:39,938 WARN  [CallbackExecutor] common.AsyncDispatcher 
(AsyncDispatcher.java:handle(373)) - AsyncDispatcher thread interrupted (while 
putting event): EventType: C_COMPLETED
java.lang.InterruptedException
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223)
        at 
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
        at 
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
        at 
org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370)
        at 
org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97)
        at 
org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357)
        at 
com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Exception in thread "CallbackExecutor" 
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.InterruptedException
        at 
org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:375)
        at 
org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97)
        at 
org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357)
        at 
com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223)
        at 
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
        at 
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
        at 
org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370)
        ... 6 more
2023-07-03 12:15:40,938 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:41,939 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:42,939 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:43,939 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:44,940 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:45,940 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:46,940 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:47,941 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:48,941 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:49,149 INFO  [Time-limited test] service.AbstractService 
(AbstractService.java:noteFailure(267)) - Service Central failed in state 
STOPPED
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at 
org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:155)
{code}

what happened here is:
1. test is stopped (tezclient.stop)
2. eventQueue.put has been 
[interrupted|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L367],
 but drained is already set to false
3. eventQueue.size == 0 and [still 
looping|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L151],
 this is only possible if [drained is false: it was set in 
GenericEventHandler.handle|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L338]
4. hence test waits for TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT (10000ms 
default), but test timeout is also 10000ms: instead of increasing test timeout 
it would make sense to take care of drained variable in the event of an 
interrupt

  was:
The relevant part from AsyncDispatcher is:
{code}
      synchronized (waitForDrained) {
        while (!drained && eventHandlingThread.isAlive() && 
System.currentTimeMillis() < endTime) {
          waitForDrained.wait(1000);
          LOG.info(
              "Waiting for AsyncDispatcher to drain. Current queue size: {}, 
handler thread state: {}",
              eventQueue.size(), eventHandlingThread.getState());
        }
      }
{code}

Here is a log snippet from TestLocalMode when looping for 
TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT simply caused a test timeout because 
drained was false.

{code}
2023-07-03 12:15:39,938 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to stop, 
ignoring any new events.
2023-07-03 12:15:39,938 INFO  [CallbackExecutor] common.AsyncDispatcher 
(AsyncDispatcher.java:handle(369)) - AsyncDispatcher put event: EventType: 
C_COMPLETED
2023-07-03 12:15:39,938 WARN  [CallbackExecutor] common.AsyncDispatcher 
(AsyncDispatcher.java:handle(373)) - AsyncDispatcher thread interrupted (while 
putting event): EventType: C_COMPLETED
java.lang.InterruptedException
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223)
        at 
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
        at 
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
        at 
org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370)
        at 
org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97)
        at 
org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357)
        at 
com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Exception in thread "CallbackExecutor" 
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.InterruptedException
        at 
org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:375)
        at 
org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97)
        at 
org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357)
        at 
com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223)
        at 
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
        at 
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
        at 
org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370)
        ... 6 more
2023-07-03 12:15:40,938 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:41,939 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:42,939 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:43,939 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:44,940 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:45,940 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:46,940 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:47,941 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:48,941 INFO  [Time-limited test] common.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. 
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:49,149 INFO  [Time-limited test] service.AbstractService 
(AbstractService.java:noteFailure(267)) - Service Central failed in state 
STOPPED
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at 
org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:155)
{code}

what happened here is:
1. test is stopped (tezclient.stop)
2. eventQueue.put has been interrupted, but drained is already set to false
3. eventQueue.size == 0 and still looping, this is only possible if [drained is 
false: it was set in 
GenericEventHandler.handle|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L338]
4. hence test waits for TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT (10000ms 
default), but test timeout is also 10000ms: instead of increasing test timeout 
it would make sense to take care of drained variable in the event of an 
interrupt


> Fix TestLocalMode timeouts
> --------------------------
>
>                 Key: TEZ-4501
>                 URL: https://issues.apache.org/jira/browse/TEZ-4501
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The relevant part from AsyncDispatcher is:
> {code}
>       synchronized (waitForDrained) {
>         while (!drained && eventHandlingThread.isAlive() && 
> System.currentTimeMillis() < endTime) {
>           waitForDrained.wait(1000);
>           LOG.info(
>               "Waiting for AsyncDispatcher to drain. Current queue size: {}, 
> handler thread state: {}",
>               eventQueue.size(), eventHandlingThread.getState());
>         }
>       }
> {code}
> Here is a log snippet from TestLocalMode when looping for 
> TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT simply caused a test timeout because 
> drained was false.
> {code}
> 2023-07-03 12:15:39,938 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to 
> stop, ignoring any new events.
> 2023-07-03 12:15:39,938 INFO  [CallbackExecutor] common.AsyncDispatcher 
> (AsyncDispatcher.java:handle(369)) - AsyncDispatcher put event: EventType: 
> C_COMPLETED
> 2023-07-03 12:15:39,938 WARN  [CallbackExecutor] common.AsyncDispatcher 
> (AsyncDispatcher.java:handle(373)) - AsyncDispatcher thread interrupted 
> (while putting event): EventType: C_COMPLETED
> java.lang.InterruptedException
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223)
>       at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>       at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
>       at 
> org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370)
>       at 
> org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97)
>       at 
> org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357)
>       at 
> com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> Exception in thread "CallbackExecutor" 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
>       at 
> org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:375)
>       at 
> org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97)
>       at 
> org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357)
>       at 
> com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.InterruptedException
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223)
>       at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>       at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
>       at 
> org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370)
>       ... 6 more
> 2023-07-03 12:15:40,938 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:41,939 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:42,939 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:43,939 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:44,940 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:45,940 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:46,940 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:47,941 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:48,941 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:49,149 INFO  [Time-limited test] service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service Central failed in state 
> STOPPED
> java.lang.InterruptedException
>       at java.lang.Object.wait(Native Method)
>       at 
> org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:155)
> {code}
> what happened here is:
> 1. test is stopped (tezclient.stop)
> 2. eventQueue.put has been 
> [interrupted|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L367],
>  but drained is already set to false
> 3. eventQueue.size == 0 and [still 
> looping|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L151],
>  this is only possible if [drained is false: it was set in 
> GenericEventHandler.handle|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L338]
> 4. hence test waits for TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT (10000ms 
> default), but test timeout is also 10000ms: instead of increasing test 
> timeout it would make sense to take care of drained variable in the event of 
> an interrupt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to