[ https://issues.apache.org/jira/browse/TEZ-4501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ayush Saxena resolved TEZ-4501. ------------------------------- Fix Version/s: 0.10.4 Resolution: Fixed > Fix TestLocalMode timeouts - handle drained in case of InterruptedException > --------------------------------------------------------------------------- > > Key: TEZ-4501 > URL: https://issues.apache.org/jira/browse/TEZ-4501 > Project: Apache Tez > Issue Type: Bug > Reporter: László Bodor > Assignee: László Bodor > Priority: Major > Fix For: 0.10.4 > > Time Spent: 2h > Remaining Estimate: 0h > > The relevant part from AsyncDispatcher is: > {code} > synchronized (waitForDrained) { > while (!drained && eventHandlingThread.isAlive() && > System.currentTimeMillis() < endTime) { > waitForDrained.wait(1000); > LOG.info( > "Waiting for AsyncDispatcher to drain. Current queue size: {}, > handler thread state: {}", > eventQueue.size(), eventHandlingThread.getState()); > } > } > {code} > Here is a log snippet from TestLocalMode when looping for > TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT simply caused a test timeout because > drained was false. > {code} > 2023-07-03 12:15:39,938 INFO [Time-limited test] common.AsyncDispatcher > (AsyncDispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to > stop, ignoring any new events. > 2023-07-03 12:15:39,938 INFO [CallbackExecutor] common.AsyncDispatcher > (AsyncDispatcher.java:handle(369)) - AsyncDispatcher put event: EventType: > C_COMPLETED > 2023-07-03 12:15:39,938 WARN [CallbackExecutor] common.AsyncDispatcher > (AsyncDispatcher.java:handle(373)) - AsyncDispatcher thread interrupted > (while putting event): EventType: C_COMPLETED > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) > at > org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370) > at > org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97) > at > org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357) > at > com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Exception in thread "CallbackExecutor" > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.InterruptedException > at > org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:375) > at > org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97) > at > org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357) > at > com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) > at > org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370) > ... 6 more > 2023-07-03 12:15:40,938 INFO [Time-limited test] common.AsyncDispatcher > (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to > drain. Current queue size: 0, handler thread state: WAITING > 2023-07-03 12:15:41,939 INFO [Time-limited test] common.AsyncDispatcher > (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to > drain. Current queue size: 0, handler thread state: WAITING > 2023-07-03 12:15:42,939 INFO [Time-limited test] common.AsyncDispatcher > (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to > drain. Current queue size: 0, handler thread state: WAITING > 2023-07-03 12:15:43,939 INFO [Time-limited test] common.AsyncDispatcher > (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to > drain. Current queue size: 0, handler thread state: WAITING > 2023-07-03 12:15:44,940 INFO [Time-limited test] common.AsyncDispatcher > (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to > drain. Current queue size: 0, handler thread state: WAITING > 2023-07-03 12:15:45,940 INFO [Time-limited test] common.AsyncDispatcher > (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to > drain. Current queue size: 0, handler thread state: WAITING > 2023-07-03 12:15:46,940 INFO [Time-limited test] common.AsyncDispatcher > (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to > drain. Current queue size: 0, handler thread state: WAITING > 2023-07-03 12:15:47,941 INFO [Time-limited test] common.AsyncDispatcher > (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to > drain. Current queue size: 0, handler thread state: WAITING > 2023-07-03 12:15:48,941 INFO [Time-limited test] common.AsyncDispatcher > (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to > drain. Current queue size: 0, handler thread state: WAITING > 2023-07-03 12:15:49,149 INFO [Time-limited test] service.AbstractService > (AbstractService.java:noteFailure(267)) - Service Central failed in state > STOPPED > java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at > org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:155) > {code} > what happened here is: > 1. test is stopped (tezclient.stop) > 2. eventQueue.put has been > [interrupted|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L367], > but drained is already [set to > false|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L338] > 3. eventQueue.size == 0 and [still > looping|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L151], > this is only possible if [drained is false: it was set in > GenericEventHandler.handle|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L338] > 4. hence test waits for TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT (10000ms > default), but test timeout is also 10000ms: instead of increasing test > timeout it would make sense to take care of drained variable in the event of > an interrupt -- This message was sent by Atlassian Jira (v8.20.10#820010)