kfaraz opened a new pull request, #13197:
URL: https://github.com/apache/druid/pull/13197
Fixes multiple items in #12881
### Description
This PR lays the ground work to allow load queue to safely have an unlimited
number of items, and thus
eventually phase out `maxSegmentsInNodeLoadingQueue` and
`replicationThrottleLimit`.
Load queue is already allowed to have unlimited items (by setting
`maxSegmentsInNodeLoadingQueue`)
but this leads to
- each coordinator run taking a very long time
- poor assignments which take several more runs to be rectified
### Changes
#### Classes to review
- `SegmentLoader`
- `LoadRule`, `BroadcastDistributionRule`
- `SegmentStateManager`
- `LoadRuleTest`
- `LoadQueuePeon`: http and curator
#### Behavioral changes
Change | Motivation
--------|-------------
Both loaded and loading items count towards replication. | Allow coordinator
to take corrective action of removing superfluous replicas without waiting for
them to be fully loaded.
Load, drop or move operations can be cancelled. | - Allow move of loading
items from queue of one server to another. <br> - Allow coordinator to take
corrective actions quickly.
During tier shift, drop of segments from unneeded tiers waits only for basic
fault tolerance and not full replication of segment on new target tiers. | -
Prevent replication from blocking drop of unneeded segments, thus allowing
faster decommission of servers and freeing up disk space to load new segments.
<br> - Fault tolerance is handled as the drops still wait for a replication
factor of 2 on target tiers (unless rule specifies only 1 replica).
LoadQueuePeon can distinguish between load of a primary, a replica or a load
required for balancing. | - Allow prioritization of items, which becomes
important if load queue size is unlimited. <br> - Avoid considering balancing
items in load queue as over-replicated.
`replicationThrottleLimit` does not act on first replica in any tier |
Throttling first replica on a tier undermines the purpose of tiering. Tiering
is not meant for fault tolerance, rather serving different query needs. Thus
segments should be available on target tiers as soon as possible.
⚠️ `maxNonPrimaryReplicantsToLoad` does not act on first replica in any tier
| This was done keeping in line with the changes to `replicationThrottleLimit`
but it should probably be reverted to prevent unexpected behaviour for clusters
using this config.
`maxSegmentsInNodeLoadingQueue` to act per run rather than the current
number of items in queue | Currently, if the configured load queue size is
large enough to allow load of some segments while a coordinator run is in
progress, the load queue limit is violated as there is always some room in the
queue. This causes coordinator runs to get stuck cycling through all the
segments in spite of a limited load queue.
#### Structural changes
Change | Motivation
--------|-----------
Add `SegmentLoader` which handles all the load, move and drop operations.
The lifecycle of the loader is tied to a single coordinator run. In the changed
code, it is instantiated once in every run of `RunRules` and `BalanceSegments`.
| - Allow reuse of logic for loading, balancing and broadcasting. <br> - Single
place to maintain state of a single run thus allowing better metrics and
logging.
Load rules just specify their desired state and leave the actual decision
making to the `SegmentLoader`. | Simpler logic for load rules.
Add `SegmentStateManager` that maintains state across coordinator runs and
interacts with the load queues. | - Single place to interact with load queue
and maintain state of all in-flight segments - Allow reporting of metrics from
queue callbacks. - Prevent callbacks from holding references to items from the
previous coordinator run.
#### New metrics
- `segment/cancelLoad/count`
- `segment/cancelDrop/count`
- `segment/broadcastLoad/count`
- `segment/broadcastDrop/count`
- some more being added
### Pending changes in this PR
- Remove old dead code from `LoadRule`, `BroadcastDistributionRule`
- Fix failing tests
- Add more UTs and simulation tests for `SegmentLoader`
- Double-check logs and connect up new metrics
- Update metric and config documentation
- Complete self-review
### Further work
- Add prioritization strategy inside `LoadQueuePeon`s. Current
prioritization is based only on time.
Strategy could be something like:
- action: drop > load (already implemented)
- segment interval: new > old (already implemented)
- load type: primary > replica > broadcast > balancing move
- Fix balancing strategy to ensure that unlimited load queues do not cause
coordinator runs to take forever.
- Allow balancing strategy to pick moving segments (this has been allowed in
this PR but not enabled in any of the existing strategies)
- Add metrics to identify time spent by items in load queue
<hr>
This PR has:
- [ ] been self-reviewed.
- [ ] using the [concurrency
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
(Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [x] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [x] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]