umisan commented on issue #15628: URL: https://github.com/apache/druid/issues/15628#issuecomment-2000099306
I was in same trouble on druid 25.0. After code reading and several experiments, I have found a solution to this phenomena. In my situation, it is caused by an coordinator duty, such that RunRules. Coordinator runs some duty in single thread. e.g. - LogUsedSegments - UpdateCoordinatorStateAndPrepareCluster - RunRules - UnloadUnusedSegments - MarkAsUnusedOvershadowedSegments - BalanceSegments These duties are executed by ScheduledExecutors. So, if one duty runs too long time, proceeding duties should wait to it finish. UpdateCoordinatorStateAndPrepareCluster finds new historical nodes and changes status to be able to load new segments. But once historical nodes that have many segments go down, RunRules try to load too many not primary segment to other nodes. This leads too long runtime of RunRules and UpdateCoordinatorStateAndPrepareCluster are never executed until it finish. here is my solutions. - set small value to maxNonPrimaryReplicantsToLoad (default value is too large) - use round robin segment assignment -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
