leventov opened a new issue #7159: On "balancing burst, then wait total 
loading" pattern and replication/loading cap in Coordinator balancing
URL: https://github.com/apache/incubator-druid/issues/7159
 
 
   Druid's segment balancing logic is split between two classes: 
`DruidCoordinatorBalancer` and `LoadRule`. They both skip segment loading and 
balancing until all their actions from the previous non-skipped run (resulted 
in a _balancing burst_) are completely settled down (per tier), i. e. there are 
no loading segments on any of the historical nodes in some tier.
   
   The "hard" reason for obeying this pattern in `LoadRule` is that 
`SegmentReplicantLookup` is effectively immutable and recreated at the 
beginning of a Coordinator's cycle (see `CoordinatorHistoricalManagerRunnable`).
   
   The reason for obeying this pattern in `DruidCoordinatorBalancer` is not 
documented anywhere, but as far as I can tell, that's done to make balancing 
decisions more optimal: we quickly fill up loading queues of the "universally 
best" historical nodes (e. g. because they've just entered the cluster and are 
almost empty yet) and are forced to move segments to less optimal nodes. (This 
is also probably the primary criterion for choosing `maxSegmentsToMove` 
configuration value, although the documentation is silent about this.) This 
"soft" reason is implicitly in play in `LoadRule`, too.
   
   The problem is that `LoadRule` and `DruidCoordinatorBalancer` use 
independent mechanism to implement the "burst - wait" pattern: `LoadRule` uses 
`ReplicationThrottler` (implicitly in 
[`canCreateReplicant()`](https://github.com/apache/incubator-druid/blob/e432965c13ef7799598d1ec2f88eb665b8fceaf5/server/src/main/java/org/apache/druid/server/coordinator/ReplicationThrottler.java#L88)
 method), `DruidCoordinatorBalancer` uses it's own `currentlyMovingSegments` 
field.
   
   As you can notice from `ReplicationThrottler`'s name, it's also responsible 
for capping the number of segments that can be moved in `LoadRule` during a 
single burst, via `replicationThrottleLimit` configuration. In other words, 
`replicationThrottleLimit` has exactly the same purpose as `maxSegmentsToMove`. 
But they don't mention each other and they effectively add up to each other.
   
   I see the solution in gathering all balancing logic to a single class, 
deprecate either `replicationThrottleLimit` or `maxSegmentsToMove` in favor of 
the other one as the unified limit for balancing bursts.
   
   Also, there is a problem with waiting until exactly zero remaining loading 
segments from the previous burst in a tier: struggling loaders might make 
pauses between the bursts longer and thus the effective balancing and loading 
throughput lower. Instead, we may initiate the next burst when only 10-20% of 
segments are remaining unloaded from the previous burst. We don't want to forgo 
the "burst - wait" pattern completely and make just so many balancing and 
loading decisions on each Coordinator's run to add up to the unified limit 
because this may keep the cluster always close to the threshold of 
suboptimality of balancing decisions and thus making balancing decisions less 
optimal overall.
   
   To be able to initiate the next burst when the previous burst is not 
complete `SegmentReplicantLookup` should be made concurrently updatable. 
Instead of recreating it before each Coordinator's run, it's an ever-living 
concurrent data structure.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to