[GitHub] [incubator-druid] leventov opened a new issue #7159: On "balancing burst, then wait total loading" pattern and replication/loading cap in Coordinator balancing

GitBox Mon, 22 Jul 2019 12:02:05 -0700

leventov opened a new issue #7159: On "balancing burst, then wait total
loading" pattern and replication/loading cap in Coordinator balancing
URL: https://github.com/apache/incubator-druid/issues/7159

Druid's segment balancing logic is split between two classes:
`DruidCoordinatorBalancer` and `LoadRule`. They both skip segment loading and
balancing until all their actions from the previous non-skipped run (resulted
in a _balancing burst_) are completely settled down (per tier), i. e. there are
no loading segments on any of the historical nodes in some tier.

The "hard" reason for obeying this pattern in `LoadRule` is that
`SegmentReplicantLookup` is effectively immutable and recreated at the
beginning of a Coordinator's cycle (see `CoordinatorHistoricalManagerRunnable`).

The reason for obeying this pattern in `DruidCoordinatorBalancer` is not
documented anywhere, but as far as I can tell, that's done to make balancing
decisions more optimal: we quickly fill up loading queues of the "universally
best" historical nodes (e. g. because they've just entered the cluster and are
almost empty yet) and are forced to move segments to less optimal nodes. (This
is also probably the primary criterion for choosing `maxSegmentsToMove`
configuration value, although the documentation is silent about this.) This
"soft" reason is implicitly in play in `LoadRule`, too.

The problem is that `LoadRule` and `DruidCoordinatorBalancer` use
independent mechanism to implement the "burst - wait" pattern: `LoadRule` uses
`ReplicationThrottler` (implicitly in
[`canCreateReplicant()`](https://github.com/apache/incubator-druid/blob/e432965c13ef7799598d1ec2f88eb665b8fceaf5/server/src/main/java/org/apache/druid/server/coordinator/ReplicationThrottler.java#L88)
method), `DruidCoordinatorBalancer` uses it's own `currentlyMovingSegments`
field.

As you can notice from `ReplicationThrottler`'s name, it's also responsible
for capping the number of segments that can be moved in `LoadRule` during a
single burst, via `replicationThrottleLimit` configuration. In other words,
`replicationThrottleLimit` has exactly the same purpose as `maxSegmentsToMove`.
But they don't mention each other and they effectively add up to each other.

I see the solution in gathering all balancing logic to a single class,
deprecate either `replicationThrottleLimit` or `maxSegmentsToMove` in favor of
the other one as the unified limit for balancing bursts.

Also, there is a problem with waiting until exactly zero remaining loading
segments from the previous burst in a tier: struggling loaders might make
pauses between the bursts longer and thus the effective balancing and loading
throughput lower. Instead, we may initiate the next burst when only 10-20% of
segments are remaining unloaded from the previous burst. We don't want to forgo
the "burst - wait" pattern completely and make just so many balancing and
loading decisions on each Coordinator's run to add up to the unified limit
because this may keep the cluster always close to the threshold of
suboptimality of balancing decisions and thus making balancing decisions less
optimal overall.

To be able to initiate the next burst when the previous burst is not
complete `SegmentReplicantLookup` should be made concurrently updatable.
Instead of recreating it before each Coordinator's run, it's an ever-living
concurrent data structure.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-druid] leventov opened a new issue #7159: On "balancing burst, then wait total loading" pattern and replication/loading cap in Coordinator balancing

Reply via email to