pjain1 opened a new issue #10068: URL: https://github.com/apache/druid/issues/10068
Found while investigating https://github.com/apache/druid/issues/10067. RandomBalancerStrategy gets stuck into loop when the number of replicants is more than the number of nodes. ### Affected Version All ### Description Setup - I start with two empty historical with server size enough to load one segment of size 4,821,713. Replication factor is set to 3. This gets loaded but when the `RunRule` tries to find a place to load 3rd to load the segment, it gets stuck in a loop and never comes out. `RunRule` duty does not run after that. Here's the relevant thread dump where it gets stuck - ``` "Coordinator-Exec--0" #217 daemon prio=5 os_prio=31 tid=0x00007fc6c023c800 nid=0x29a03 runnable [0x000070001aafc000] java.lang.Thread.State: RUNNABLE at org.apache.druid.server.coordinator.RandomBalancerStrategy.findNewSegmentHomeReplicator(RandomBalancerStrategy.java:40) at org.apache.druid.server.coordinator.rules.LoadRule.assignReplicasForTier(LoadRule.java:298) at org.apache.druid.server.coordinator.rules.LoadRule.assignReplicas(LoadRule.java:243) at org.apache.druid.server.coordinator.rules.LoadRule.assign(LoadRule.java:105) at org.apache.druid.server.coordinator.rules.LoadRule.run(LoadRule.java:78) at org.apache.druid.server.coordinator.duty.RunRules.run(RunRules.java:113) at org.apache.druid.server.coordinator.DruidCoordinator$DutiesRunnable.run(DruidCoordinator.java:710) at org.apache.druid.server.coordinator.DruidCoordinator$2.call(DruidCoordinator.java:570) at org.apache.druid.server.coordinator.DruidCoordinator$2.call(DruidCoordinator.java:563) at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$2.run(ScheduledExecutors.java:92) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` The code line where it gets stuck is this - https://github.com/apache/druid/blob/master/server/src/main/java/org/apache/druid/server/coordinator/RandomBalancerStrategy.java#L41. This is line 40 in my local codebase that's why the thread dump has `RandomBalancerStrategy.java:40` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
