[ https://issues.apache.org/jira/browse/TEZ-4501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
László Bodor updated TEZ-4501: ------------------------------ Description: The relevant part from AsyncDispatcher is: {code} synchronized (waitForDrained) { while (!drained && eventHandlingThread.isAlive() && System.currentTimeMillis() < endTime) { waitForDrained.wait(1000); LOG.info( "Waiting for AsyncDispatcher to drain. Current queue size: {}, handler thread state: {}", eventQueue.size(), eventHandlingThread.getState()); } } {code} Here is a log snippet from TestLocalMode when looping for TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT simply caused a test timeout because drained was false. {code} 2023-07-03 12:15:39,938 INFO [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to stop, ignoring any new events. 2023-07-03 12:15:39,938 INFO [CallbackExecutor] common.AsyncDispatcher (AsyncDispatcher.java:handle(369)) - AsyncDispatcher put event: EventType: C_COMPLETED 2023-07-03 12:15:39,938 WARN [CallbackExecutor] common.AsyncDispatcher (AsyncDispatcher.java:handle(373)) - AsyncDispatcher thread interrupted (while putting event): EventType: C_COMPLETED java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) at org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370) at org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97) at org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357) at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Exception in thread "CallbackExecutor" org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.InterruptedException at org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:375) at org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97) at org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357) at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) at org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370) ... 6 more 2023-07-03 12:15:40,938 INFO [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING 2023-07-03 12:15:41,939 INFO [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING 2023-07-03 12:15:42,939 INFO [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING 2023-07-03 12:15:43,939 INFO [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING 2023-07-03 12:15:44,940 INFO [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING 2023-07-03 12:15:45,940 INFO [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING 2023-07-03 12:15:46,940 INFO [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING 2023-07-03 12:15:47,941 INFO [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING 2023-07-03 12:15:48,941 INFO [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING 2023-07-03 12:15:49,149 INFO [Time-limited test] service.AbstractService (AbstractService.java:noteFailure(267)) - Service Central failed in state STOPPED java.lang.InterruptedException at java.lang.Object.wait(Native Method) at org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:155) {code} what happened here is: 1. test is stopped (tezclient.stop) 2. eventQueue.put has been [interrupted|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L367], but drained is already set to false 3. eventQueue.size == 0 and [still looping|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L151], this is only possible if [drained is false: it was set in GenericEventHandler.handle|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L338] 4. hence test waits for TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT (10000ms default), but test timeout is also 10000ms: instead of increasing test timeout it would make sense to take care of drained variable in the event of an interrupt was: The relevant part from AsyncDispatcher is: {code} synchronized (waitForDrained) { while (!drained && eventHandlingThread.isAlive() && System.currentTimeMillis() < endTime) { waitForDrained.wait(1000); LOG.info( "Waiting for AsyncDispatcher to drain. Current queue size: {}, handler thread state: {}", eventQueue.size(), eventHandlingThread.getState()); } } {code} Here is a log snippet from TestLocalMode when looping for TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT simply caused a test timeout because drained was false. {code} 2023-07-03 12:15:39,938 INFO [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to stop, ignoring any new events. 2023-07-03 12:15:39,938 INFO [CallbackExecutor] common.AsyncDispatcher (AsyncDispatcher.java:handle(369)) - AsyncDispatcher put event: EventType: C_COMPLETED 2023-07-03 12:15:39,938 WARN [CallbackExecutor] common.AsyncDispatcher (AsyncDispatcher.java:handle(373)) - AsyncDispatcher thread interrupted (while putting event): EventType: C_COMPLETED java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) at org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370) at org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97) at org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357) at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Exception in thread "CallbackExecutor" org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.InterruptedException at org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:375) at org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97) at org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357) at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) at org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370) ... 6 more 2023-07-03 12:15:40,938 INFO [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING 2023-07-03 12:15:41,939 INFO [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING 2023-07-03 12:15:42,939 INFO [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING 2023-07-03 12:15:43,939 INFO [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING 2023-07-03 12:15:44,940 INFO [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING 2023-07-03 12:15:45,940 INFO [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING 2023-07-03 12:15:46,940 INFO [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING 2023-07-03 12:15:47,941 INFO [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING 2023-07-03 12:15:48,941 INFO [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING 2023-07-03 12:15:49,149 INFO [Time-limited test] service.AbstractService (AbstractService.java:noteFailure(267)) - Service Central failed in state STOPPED java.lang.InterruptedException at java.lang.Object.wait(Native Method) at org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:155) {code} what happened here is: 1. test is stopped (tezclient.stop) 2. eventQueue.put has been interrupted, but drained is already set to false 3. eventQueue.size == 0 and still looping, this is only possible if [drained is false: it was set in GenericEventHandler.handle|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L338] 4. hence test waits for TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT (10000ms default), but test timeout is also 10000ms: instead of increasing test timeout it would make sense to take care of drained variable in the event of an interrupt > Fix TestLocalMode timeouts > -------------------------- > > Key: TEZ-4501 > URL: https://issues.apache.org/jira/browse/TEZ-4501 > Project: Apache Tez > Issue Type: Bug > Reporter: László Bodor > Assignee: László Bodor > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > The relevant part from AsyncDispatcher is: > {code} > synchronized (waitForDrained) { > while (!drained && eventHandlingThread.isAlive() && > System.currentTimeMillis() < endTime) { > waitForDrained.wait(1000); > LOG.info( > "Waiting for AsyncDispatcher to drain. Current queue size: {}, > handler thread state: {}", > eventQueue.size(), eventHandlingThread.getState()); > } > } > {code} > Here is a log snippet from TestLocalMode when looping for > TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT simply caused a test timeout because > drained was false. > {code} > 2023-07-03 12:15:39,938 INFO [Time-limited test] common.AsyncDispatcher > (AsyncDispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to > stop, ignoring any new events. > 2023-07-03 12:15:39,938 INFO [CallbackExecutor] common.AsyncDispatcher > (AsyncDispatcher.java:handle(369)) - AsyncDispatcher put event: EventType: > C_COMPLETED > 2023-07-03 12:15:39,938 WARN [CallbackExecutor] common.AsyncDispatcher > (AsyncDispatcher.java:handle(373)) - AsyncDispatcher thread interrupted > (while putting event): EventType: C_COMPLETED > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) > at > org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370) > at > org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97) > at > org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357) > at > com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Exception in thread "CallbackExecutor" > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.InterruptedException > at > org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:375) > at > org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97) > at > org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357) > at > com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) > at > org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370) > ... 6 more > 2023-07-03 12:15:40,938 INFO [Time-limited test] common.AsyncDispatcher > (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to > drain. Current queue size: 0, handler thread state: WAITING > 2023-07-03 12:15:41,939 INFO [Time-limited test] common.AsyncDispatcher > (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to > drain. Current queue size: 0, handler thread state: WAITING > 2023-07-03 12:15:42,939 INFO [Time-limited test] common.AsyncDispatcher > (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to > drain. Current queue size: 0, handler thread state: WAITING > 2023-07-03 12:15:43,939 INFO [Time-limited test] common.AsyncDispatcher > (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to > drain. Current queue size: 0, handler thread state: WAITING > 2023-07-03 12:15:44,940 INFO [Time-limited test] common.AsyncDispatcher > (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to > drain. Current queue size: 0, handler thread state: WAITING > 2023-07-03 12:15:45,940 INFO [Time-limited test] common.AsyncDispatcher > (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to > drain. Current queue size: 0, handler thread state: WAITING > 2023-07-03 12:15:46,940 INFO [Time-limited test] common.AsyncDispatcher > (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to > drain. Current queue size: 0, handler thread state: WAITING > 2023-07-03 12:15:47,941 INFO [Time-limited test] common.AsyncDispatcher > (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to > drain. Current queue size: 0, handler thread state: WAITING > 2023-07-03 12:15:48,941 INFO [Time-limited test] common.AsyncDispatcher > (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to > drain. Current queue size: 0, handler thread state: WAITING > 2023-07-03 12:15:49,149 INFO [Time-limited test] service.AbstractService > (AbstractService.java:noteFailure(267)) - Service Central failed in state > STOPPED > java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at > org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:155) > {code} > what happened here is: > 1. test is stopped (tezclient.stop) > 2. eventQueue.put has been > [interrupted|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L367], > but drained is already set to false > 3. eventQueue.size == 0 and [still > looping|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L151], > this is only possible if [drained is false: it was set in > GenericEventHandler.handle|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L338] > 4. hence test waits for TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT (10000ms > default), but test timeout is also 10000ms: instead of increasing test > timeout it would make sense to take care of drained variable in the event of > an interrupt -- This message was sent by Atlassian Jira (v8.20.10#820010)