hardikbajaj opened a new issue, #16783: URL: https://github.com/apache/druid/issues/16783
Druid indexer tasks sometimes get stuck in PUBLISHING state, due to executors are not shut down properly. ### Affected Version Druid 25.0.0 ### Description We are running Kafka supervisor ingestion task, with task replication as two. - Task [A1, A2] are replica tasks of same TaskGroup. After task rollover happens, it creates new set of tasks[B1, B2] to start ingesting, and TaskGroup [A1, A2] is added to pending completion task group. - Task A1 was completed with a SUCESS and it send stop signal to both A1, A2. Async Stop request is send to task A2, which returns 200, and indexer start stopping the task Gracefully. But task A2 got forever stuck in PUBLISHING state on indexer. - Overlord detects A2 active task, which is neither in ActivelyReadingTaskGroup nor PendingCompletionTaskGroup. So It creates a NEW PendingCompletionTaskGroup with only A2 Task. - Task A2 was stuck in PUBLISHING state on indexer. After a timeout period of `PendingCompletionTimeout` minutes, this task is forcefully killed. Since overlord sees Task group completion timeout is passed, and task is not sucess. So it **KILLS ACTIVELY READING TASKS.** ### Why Task A2 got stuck in PUBLISHING state? I did some debugging which is the probable cause of task getting stuck. - When Task A2 got a Stop signal, the main thread was interrupted Immediately. A2 task was handing off the segments at the time and had completed publishing. - From logs, we can Indexer was handing off 3 segments. But it sends the Dropped segment log for only one segment. ``` 2 Jul 2024 @ 03:05:55.583 UTC Segment[S0] successfully handed off, dropping. indexer-pod coordinator_handoff_scheduled_0 2 Jul 2024 @ 03:05:55.585 UTC Segment[S1] successfully handed off, dropping. indexer-pod coordinator_handoff_scheduled_0 2 Jul 2024 @ 03:05:55.587 UTC Segment[S2] successfully handed off, dropping. indexer-pod coordinator_handoff_scheduled_0 2 Jul 2024 @ 03:05:56.223 UTC Stopping gracefully (status: [PUBLISHING]) indexer-pod thread 2 Jul 2024 @ 03:05:56.223 UTC Closing reporter org.apache.kafka.common.metrics.JmxReporter indexer-pod [task_id]-threading-task-runner-executor-0 2 Jul 2024 @ 03:05:56.224 UTC Shutting down immediately... indexer-pod main thread 2 Jul 2024 @ 03:05:56.258 UTC Dropped segment[S0]. indexer-pod [task_id]-appenderator-persist ``` - The task seem to get stuck, waiting for the persist executor to shut down. As we can see, last log line the main thread printed was `Shutting down immediately...` . After that we [shut down executors](https://github.com/apache/druid/blob/master/server/src/main/java/org/apache/druid/segment/realtime/appenderator/StreamAppenderator.java#L1081) and wait for the them to shut down, **with a timeout of 365 Days** ``` Preconditions.checkState( persistExecutor == null || persistExecutor.awaitTermination(365, TimeUnit.DAYS), "persistExecutor not terminated" ); ``` - After `PendingCompletionTimeout` minutes, this task is forcefully killed and marked as FAILED. And is stopped again, which again interrupts the main ingestion thread. We get this ERROR log line ``` 2 Jul 2024 @ 03:40:01.670 UTC Exception caught during execution indexer-pod threading-task-runner-executor-0 java.lang.RuntimeException: org.apache.druid.java.util.common.RE: Current thread is interrupted after [0] tries at org.apache.druid.storage.s3.S3TaskLogs.pushTaskFile(S3TaskLogs.java:156) at org.apache.druid.storage.s3.S3TaskLogs.pushTaskReports(S3TaskLogs.java:141) at org.apache.druid.indexing.overlord.ThreadingTaskRunner$1.call(ThreadingTaskRunner.java:223) at org.apache.druid.indexing.overlord.ThreadingTaskRunner$1.call(ThreadingTaskRunner.java:152) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) Caused by: org.apache.druid.java.util.common.RE: Current thread is interrupted after [0] tries at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:148) at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:81) at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:163) at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:153) at org.apache.druid.storage.s3.S3Utils.retryS3Operation(S3Utils.java:101) at org.apache.druid.storage.s3.S3TaskLogs.pushTaskFile(S3TaskLogs.java:147) ... 7 more ``` This exception allowed the main thread to go to [finally](https://github.com/apache/druid/blob/master/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java#L910) { [thread] block and remove chat handlers and stopping other task tools. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
