[ 
https://issues.apache.org/jira/browse/TEZ-4501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena resolved TEZ-4501.
-------------------------------
    Fix Version/s: 0.10.4
       Resolution: Fixed

> Fix TestLocalMode timeouts - handle drained in case of InterruptedException
> ---------------------------------------------------------------------------
>
>                 Key: TEZ-4501
>                 URL: https://issues.apache.org/jira/browse/TEZ-4501
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>             Fix For: 0.10.4
>
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> The relevant part from AsyncDispatcher is:
> {code}
>       synchronized (waitForDrained) {
>         while (!drained && eventHandlingThread.isAlive() && 
> System.currentTimeMillis() < endTime) {
>           waitForDrained.wait(1000);
>           LOG.info(
>               "Waiting for AsyncDispatcher to drain. Current queue size: {}, 
> handler thread state: {}",
>               eventQueue.size(), eventHandlingThread.getState());
>         }
>       }
> {code}
> Here is a log snippet from TestLocalMode when looping for 
> TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT simply caused a test timeout because 
> drained was false.
> {code}
> 2023-07-03 12:15:39,938 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to 
> stop, ignoring any new events.
> 2023-07-03 12:15:39,938 INFO  [CallbackExecutor] common.AsyncDispatcher 
> (AsyncDispatcher.java:handle(369)) - AsyncDispatcher put event: EventType: 
> C_COMPLETED
> 2023-07-03 12:15:39,938 WARN  [CallbackExecutor] common.AsyncDispatcher 
> (AsyncDispatcher.java:handle(373)) - AsyncDispatcher thread interrupted 
> (while putting event): EventType: C_COMPLETED
> java.lang.InterruptedException
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223)
>       at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>       at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
>       at 
> org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370)
>       at 
> org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97)
>       at 
> org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357)
>       at 
> com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> Exception in thread "CallbackExecutor" 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
>       at 
> org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:375)
>       at 
> org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97)
>       at 
> org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357)
>       at 
> com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.InterruptedException
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223)
>       at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>       at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
>       at 
> org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370)
>       ... 6 more
> 2023-07-03 12:15:40,938 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:41,939 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:42,939 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:43,939 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:44,940 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:45,940 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:46,940 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:47,941 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:48,941 INFO  [Time-limited test] common.AsyncDispatcher 
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to 
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:49,149 INFO  [Time-limited test] service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service Central failed in state 
> STOPPED
> java.lang.InterruptedException
>       at java.lang.Object.wait(Native Method)
>       at 
> org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:155)
> {code}
> what happened here is:
> 1. test is stopped (tezclient.stop)
> 2. eventQueue.put has been 
> [interrupted|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L367],
>  but drained is already [set to 
> false|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L338]
> 3. eventQueue.size == 0 and [still 
> looping|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L151],
>  this is only possible if [drained is false: it was set in 
> GenericEventHandler.handle|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L338]
> 4. hence test waits for TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT (10000ms 
> default), but test timeout is also 10000ms: instead of increasing test 
> timeout it would make sense to take care of drained variable in the event of 
> an interrupt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to