pagrawal10 commented on issue #15944:
URL: https://github.com/apache/druid/issues/15944#issuecomment-1976847918
Hey Amatya,
I debugged the logs further and this is the flow of events:
Handoff started for A2 task at 15:23. The task was waiting for handoff to be
complete.
At 15:26:19 , A1 completed and gave the stop signal to A2. When A2 received
the stop signal, it started shutting down immediately and dropped the segment
which it was handling. The segment was dropped successfully. The task could not
have been waiting for a segment to pick up the segment as the segment was
already loaded to historicals by the replica task which had completed.
At 15:26:23 , discoverTasks() function ran and put A2 in a new Pending
Completion task group.
The task never completed shutting down and was stuck somewhere till the
timeout elapsed. I see no logs for the task coming between 15:27 and 15:56
except stating its current offset.
At 15:56, we see this exception:
java.util.concurrent.ExecutionException: java.lang.RuntimeException:
java.lang.RuntimeException: org.apache.druid.java.util.common.RE: Current
thread is interrupted after [0] tries
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
at
org.apache.druid.indexing.overlord.ThreadingTaskRunner$2.call(ThreadingTaskRunner.java:323)
at
org.apache.druid.indexing.overlord.ThreadingTaskRunner$2.call(ThreadingTaskRunner.java:315)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.RuntimeException: java.lang.RuntimeException:
org.apache.druid.java.util.common.RE: Current thread is interrupted after [0]
tries
at
org.apache.druid.indexing.overlord.ThreadingTaskRunner$1.call(ThreadingTaskRunner.java:232)
at
org.apache.druid.indexing.overlord.ThreadingTaskRunner$1.call(ThreadingTaskRunner.java:152)
... 4 more
Caused by: java.lang.RuntimeException: org.apache.druid.java.util.common.RE:
Current thread is interrupted after [0] tries
at
org.apache.druid.storage.s3.S3TaskLogs.pushTaskFile(S3TaskLogs.java:156)
at
org.apache.druid.storage.s3.S3TaskLogs.pushTaskReports(S3TaskLogs.java:141)
at
org.apache.druid.indexing.overlord.ThreadingTaskRunner$1.call(ThreadingTaskRunner.java:223)
... 5 more
Caused by: org.apache.druid.java.util.common.RE: Current thread is
interrupted after [0] tries
at
org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:148)
at
org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:81)
at
org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:163)
at
org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:153)
at
org.apache.druid.storage.s3.S3Utils.retryS3Operation(S3Utils.java:101)
at
org.apache.druid.storage.s3.S3TaskLogs.pushTaskFile(S3TaskLogs.java:147)
... 7 more
I have gone through the code but could not pinpoint where the task thread
was stuck or the exception was swallowed. Can you please take a look?
It seems like the discoveryTasks() interfered with the graceful shutdown
process.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]