maytasm commented on PR #14533: URL: https://github.com/apache/druid/pull/14533#issuecomment-2910546639
@kfaraz Our cluster has about 2000 kafka indexing tasks and 3000 compaction related indexing tasks running. When many tasks roll over at the same time, we saw some issues with the overlord not processing the task updates fast enough. We had instances where it was so slow that we exceeded the completionTimeout on the kafka indexing tasks, causing them to fail. Before trying out this change, we manage it by configuring the stopTaskCount on each supervisor and using taskDuration that are prime numbers (to minimize tasks rolling at the same time). The default value of 5 was inadequate for us. We set the config to be 1/3 of the number of CPU we have (which is about 30) and life has been great. `task/status/queue/count` has been stable and low (no spikes), we haven't had issue with completionTimeout, and didn't need to tune stopTaskCount+taskDuration. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
