kfaraz opened a new pull request, #13197:
URL: https://github.com/apache/druid/pull/13197
Fixes multiple items in #12881
### Description
This PR lays the ground work to allow load queue to safely have an unlimited
number of items, and thus
eventually phase out `maxSegmentsInNodeLoadingQueue` and
`replicationThrottleLimit`.
Load queue is already allowed to have unlimited items (by setting
`maxSegmentsInNodeLoadingQueue = 0`)
but this leads to
- each coordinator run taking a very long time
- poor assignments which take several more runs to be rectified
### Changes
#### Classes to review
- `SegmentLoader`
- `LoadRule`, `BroadcastDistributionRule`
- `SegmentStateManager`
- `LoadQueuePeon`: http and curator
- `SegmentHolder`
- `LoadRuleTest`
#### Behavioral changes
Change | Motivation
--------|-------------
Both loaded and loading items count towards replication. | Allow coordinator
to take corrective action of removing superfluous replicas without waiting for
them to be fully loaded.
Load, drop or move operations can be cancelled. | - Allow move of loading
items from queue of one server to another. <br> - Allow coordinator to take
corrective actions quickly.
During tier shift, always maintain the currently configured level of
replication, no matter which tier it happens to be on. | - Queue drop of
unneeded segments as soon as possible, thus allowing faster decommission of
servers and freeing up disk space to load new segments.<br>- Always maintain
target level of replication, thus ensuring that segment read concurrency does
not suffer
LoadQueuePeon prioritizes segment actions as DROP > LOAD > REPLICATE > MOVE
(i.e. balancing). | - Allow prioritization of items, which becomes important if
load queue size is unlimited. <br> - Avoid considering balancing items in load
queue as over-replicated.
`replicationThrottleLimit` does not act on a tier if the segment is not
loaded on that tier at all | Throttling first replica on a tier undermines the
purpose of tiering. Tiering is not meant for fault tolerance, rather serving
different query needs. Thus segments should be available on target tiers as
soon as possible.
⚠️ `maxNonPrimaryReplicantsToLoad` does not act on first replica in any tier
| This was done keeping in line with the changes to `replicationThrottleLimit`
but it should probably be reverted to prevent unexpected behaviour for clusters
using this config. cc: @capistrant
`maxSegmentsInNodeLoadingQueue` acts on the number of items assigned to the
load queue in the current run rather than the number of items present in the
queue at a given time. | Currently, if the configured load queue size is large
enough to allow load of some segments while a coordinator run is in progress,
the load queue limit is violated as there is always some room in the queue.
This causes coordinator runs to get stuck cycling through all the segments in
spite of a limited load queue.
#### Structural changes
Change | Motivation
--------|-----------
Add `SegmentLoader` which handles all the load, move and drop operations.
The lifecycle of the loader is tied to a single coordinator run. In the changed
code, it is instantiated once in every run of `RunRules` and `BalanceSegments`.
| - Allow reuse of logic for loading, balancing and broadcasting. <br> - Single
place to maintain state of a single run thus allowing better metrics and
logging.
Load rules just specify their desired state and leave the actual decision
making to the `SegmentLoader`. | Simpler logic for load rules.
Add `SegmentStateManager` that maintains state across coordinator runs and
interacts with the load queues. | - Single place to interact with load queue
and maintain state of all in-flight segments <br>- Allow reporting of metrics
from queue callbacks. <br>- Prevent callbacks from holding references to items
from the previous coordinator run.
#### New metrics
- `segment/cancelLoad/count`
- `segment/cancelDrop/count`
- `segment/broadcastLoad/count`
- `segment/broadcastDrop/count`
- some more being added
### Further work
- Fix timeout behaviour and corresponding tests
- Fix balancing strategy to ensure that unlimited load queues do not cause
coordinator runs to take forever.
- Allow balancing strategy to pick moving segments (this has been allowed in
this PR but not enabled in any of the existing strategies)
- Add metrics to identify time spent by items in load queue
<hr>
This PR has:
- [ ] been self-reviewed.
- [ ] using the [concurrency
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
(Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [x] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [x] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]