harinirajendran edited a comment on issue #11414: URL: https://github.com/apache/druid/issues/11414#issuecomment-1026275109
> I see. Thank you for confirming it. Your analysis seems correct to me. Now I'm curious what notices the supervisor was processing 🙂 @jihoonson @jasonk000 : I have some more updates wrt this issue. The supervisor actually is spending a lot of time in processing `runNotices` which is causing the `checkpointNotice` to wait in notices queue for a long time causing tasks to be stuck which results in ingestion lag. In our case, we have seen run notices take ~7s as shown in the graph below.  As a result of this, the notices queue gets backed up when the number of tasks are huge as each `runNotice ` takes a long time to process.  On further analysis, we realized that the bulk of 7s in `run_notice` processing is being spent in the [getAsyncStatus()](https://github.com/confluentinc/druid/blob/0.21.0-confluent/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L1442) call in `discoverTasks` function. When the task boots up, it roughly takes ~5s to start the JVM and start the HTTP server. So, as a result [this](https://github.com/confluentinc/druid/blob/0.21.0-confluent/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L1593) `Futures` take about ~6s to get the status of tasks that are just bootstrapping with retries resulting in `runNotice` taking such a long time. So, it's the tasks bootstrap time and hence its inability to respond to `/status` call from the supervisor that is causing `run_notice` to take ~6s causing notices queue to be backed up causing starvation of `checkpoint_ notice` causing ingestion lag. Does it make sense? Have you seen something similar on your end? How long do Kafka real-time tasks take to bootstrap on your deployments? (Also, we use Middle Managers as of today instead of Indexers). Having said this, @jasonk000 : I don't think the PRs you listed earlier (#12096 #12097 #12099) would solve the issue we are encountering, right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
