harinirajendran edited a comment on issue #11414:
URL: https://github.com/apache/druid/issues/11414#issuecomment-1026275109


   > I see. Thank you for confirming it. Your analysis seems correct to me. Now 
I'm curious what notices the supervisor was processing 🙂
   
   @jihoonson @jasonk000 : I have some more updates wrt this issue. The 
supervisor actually is spending a lot of time in processing `runNotices` which 
is causing the `checkpointNotice` to wait in notices queue for a long time 
causing tasks to be stuck which results in ingestion lag. In our case 
   
   In our case, we have seen run notices take ~7s as shown in the graph below.
   ![Screen Shot 2022-01-31 at 4 13 50 
PM](https://user-images.githubusercontent.com/9054348/151882120-44685a65-4e6d-4ee1-bfbc-667f751eed8b.png)
   As a result of this, the notices queue gets backed up when the number of 
tasks are huge as each `runNotice ` takes a long time to process.
   ![Screen Shot 2022-01-31 at 4 17 03 
PM](https://user-images.githubusercontent.com/9054348/151882291-6cea98e3-6494-464d-bb93-1af0559adbe6.png)
   
   On further analysis, we realized that the bulk of 7s in `run_notice` 
processing is being spent in the 
[getAsyncStatus()](https://github.com/confluentinc/druid/blob/0.21.0-confluent/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L1442)
 call in `discoverTasks` function. When the task boots up, it roughly takes ~5s 
to start the JVM and start the HTTP server. So, as a result 
[this](https://github.com/confluentinc/druid/blob/0.21.0-confluent/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L1593)
 `Futures` take about ~6s to get the status of tasks that are just 
bootstrapping with retries resulting in `runNotice` taking such a long time.
   
   So, it's the tasks bootstrap time and hence its inability to respond to 
`/status` call from the supervisor that is causing `run_notice` to take ~6s 
causing notices queue to be backed up causing starvation of `checkpoint_ 
notice` causing ingestion lag. Does it make sense? 
   
   Have you seen something similar on your end? How long do Kafka real-time 
tasks take to bootstrap on your deployments? (Also, we use Middle Managers as 
of today instead of Indexers). 
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to