maytasm commented on PR #14533:
URL: https://github.com/apache/druid/pull/14533#issuecomment-2910546639

   @kfaraz Our cluster has about 2000 kafka indexing tasks and 3000 compaction 
related indexing tasks running. When many tasks roll over at the same time, we 
saw some issues with the overlord not processing the task updates fast enough. 
We had instances where it was so slow that we exceeded the completionTimeout on 
the kafka indexing tasks, causing them to fail. Before trying out this change, 
we manage it by configuring the stopTaskCount on each supervisor and using 
taskDuration that are prime numbers (to minimize tasks rolling at the same 
time). The default value of 5 was inadequate for us. We set the config to be 
1/3 of the number of CPU we have (which is about 30) and life has been great. 
`task/status/queue/count` has been stable and low (no spikes), we haven't had 
issue with completionTimeout, and didn't need to tune 
stopTaskCount+taskDuration. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to