[GitHub] [druid] harinirajendran edited a comment on issue #11414: Kafka ingestion lag spikes up whenever tasks are rolling

GitBox Mon, 31 Jan 2022 14:28:32 -0800


harinirajendran edited a comment on issue #11414:
URL: https://github.com/apache/druid/issues/11414#issuecomment-1026275109

> I see. Thank you for confirming it. Your analysis seems correct to me. Now
I'm curious what notices the supervisor was processing 🙂

@jihoonson @jasonk000 : I have some more updates wrt this issue. The
supervisor actually is spending a lot of time in processing `runNotices` which
is causing the `checkpointNotice` to wait in notices queue for a long time
causing tasks to be stuck which results in ingestion lag. In our case

In our case, we have seen run notices take ~7s as shown in the graph below.
![Screen Shot 2022-01-31 at 4 13 50
PM](https://user-images.githubusercontent.com/9054348/151882120-44685a65-4e6d-4ee1-bfbc-667f751eed8b.png)
As a result of this, the notices queue gets backed up when the number of
tasks are huge as each `runNotice ` takes a long time to process.
![Screen Shot 2022-01-31 at 4 17 03
PM](https://user-images.githubusercontent.com/9054348/151882291-6cea98e3-6494-464d-bb93-1af0559adbe6.png)

On further analysis, we realized that the bulk of 7s in `run_notice`
processing is being spent in the
[getAsyncStatus()](https://github.com/confluentinc/druid/blob/0.21.0-confluent/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L1442)
call in `discoverTasks` function. When the task boots up, it roughly takes ~5s
to start the JVM and start the HTTP server. So, as a result
[this](https://github.com/confluentinc/druid/blob/0.21.0-confluent/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L1593)
`Futures` take about ~6s to get the status of tasks that are just
bootstrapping with retries resulting in `runNotice` taking such a long time.

So, it's the tasks bootstrap time and hence its inability to respond to
`/status` call from the supervisor that is causing `run_notice` to take ~6s
causing notices queue to be backed up causing starvation of `checkpoint_
notice` causing ingestion lag. Does it make sense?

Have you seen something similar on your end? How long do Kafka real-time
tasks take to bootstrap on your deployments? (Also, we use Middle Managers as
of today instead of Indexers).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] harinirajendran edited a comment on issue #11414: Kafka ingestion lag spikes up whenever tasks are rolling

Reply via email to