umisan commented on issue #15628:
URL: https://github.com/apache/druid/issues/15628#issuecomment-2000099306

   I was in same trouble on druid 25.0.
   After code reading and several experiments, I have found a solution to this 
phenomena.
   In my situation, it is caused by an coordinator duty, such that RunRules.
   Coordinator runs some duty in single thread.
   e.g.
   - LogUsedSegments
   - UpdateCoordinatorStateAndPrepareCluster
   - RunRules
   - UnloadUnusedSegments
   - MarkAsUnusedOvershadowedSegments
   - BalanceSegments
   
   These duties are executed by ScheduledExecutors.
   So, if one duty runs too long time, proceeding duties should wait to it 
finish.
   UpdateCoordinatorStateAndPrepareCluster finds new historical nodes and 
changes status to be able to load new segments.
   But once historical nodes that have many segments go down, RunRules try to 
load too many not primary segment to other nodes.
   This leads too long runtime of RunRules and 
UpdateCoordinatorStateAndPrepareCluster are never executed until it finish.
   
   here is my solutions.
   - set small value to maxNonPrimaryReplicantsToLoad (default value is too 
large)
   - use round robin segment assignment


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to