clintropolis opened a new pull request, #13344:
URL: https://github.com/apache/druid/pull/13344
RemoteTaskRunnerTest.testRunPendingTaskFailToAssignTask ` fails pretty
consistently if run until failure in intelij. After adding this thread.sleep i
let it run for over 2k iterations without failure.
I hate it, but it seems to significantly reduce the flakiness and I wasn't
able to determine a "good" fix in a short amount of time so lets do this for
now.
The underlying issue appears to be a race condition with test zk server and
worker startup, where if the timing is incorrect an `INITIALIZED` event that
happens after the first pending task is added, can result in the task runner
calling `runPendingTask`, before the test is able to call `runPendingTask`,
which makes the test assertions no longer true.
In successful runs, the logs have a section like:
```
2022-11-03T01:20:05,937 INFO [Time-limited test]
org.apache.druid.indexing.overlord.RemoteTaskRunner - Added pending task task
id with spaces
2022-11-03T01:20:05,938 ERROR [Time-limited test]
org.apache.druid.indexing.overlord.RemoteTaskRunner - Exception while trying to
assign task: {class=org.apache.druid.indexing.overlord.RemoteTaskRunner,
exceptionType=class java.lang.IllegalArgumentException, exceptionMessage=task
id != workItem id, taskId=wrongId}
java.lang.IllegalArgumentException: task id != workItem id
at
com.google.common.base.Preconditions.checkArgument(Preconditions.java:125)
~[guava-16.0.1.jar:?]
at
org.apache.druid.indexing.overlord.RemoteTaskRunner.tryAssignTask(RemoteTaskRunner.java:847)
~[classes/:?]
at
org.apache.druid.indexing.overlord.RemoteTaskRunner.runPendingTask(RemoteTaskRunner.java:771)
~[classes/:?]
```
but in the failure, there is no exception:
```
2022-11-03T01:20:17,391 INFO [Time-limited test]
org.apache.druid.indexing.overlord.RemoteTaskRunner - Added pending task task
id with spaces
2022-11-03T01:20:17,399 INFO [rtr-pending-tasks-runner-0]
org.apache.druid.indexing.overlord.RemoteTaskRunner - Assigning task [task id
with spaces] to worker [worker]
2022-11-03T01:20:17,423 INFO [rtr-pending-tasks-runner-0]
org.apache.druid.indexing.overlord.RemoteTaskRunner - Task [task id with
spaces] started running on worker [worker]
2022-11-03T01:20:18,316 INFO [SessionTracker]
org.apache.zookeeper.server.SessionTrackerImpl - SessionTrackerImpl exited loop!
2022-11-03T01:20:18,397 INFO [Time-limited test]
org.apache.druid.indexing.overlord.RemoteTaskRunner - Stopping
RemoteTaskRunner...
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]