[GitHub] [druid] kfaraz opened a new pull request, #13197: Segment loading: Allow cancellation and prioritization of load queue items

GitBox Sun, 09 Oct 2022 19:48:46 -0700


kfaraz opened a new pull request, #13197:
URL: https://github.com/apache/druid/pull/13197


   Fixes multiple items in #12881
   
   ### Description
   
   This PR lays the ground work to allow load queue to safely have an unlimited 
number of items, and thus
   eventually phase out `maxSegmentsInNodeLoadingQueue` and 
`replicationThrottleLimit`.
   
   Load queue is already allowed to have unlimited items (by setting 
`maxSegmentsInNodeLoadingQueue`)
   but this leads to
   - each coordinator run taking a very long time
   - poor assignments which take several more runs to be rectified
   
   ### Changes
   
   #### Classes to review
   - `SegmentLoader`
   - `LoadRule`, `BroadcastDistributionRule`
   - `SegmentStateManager`
   - `LoadRuleTest`
   - `LoadQueuePeon`: http and curator
   
   #### Behavioral changes
   
   Change | Motivation
   --------|-------------
   Both loaded and loading items count towards replication. | Allow coordinator 
to take corrective action of removing superfluous replicas without waiting for 
them to be fully loaded.
   Load, drop or move operations can be cancelled. | - Allow move of loading 
items from queue of one server to another. <br> - Allow coordinator to take 
corrective actions quickly.
   During tier shift, drop of segments from unneeded tiers waits only for basic 
fault tolerance and not full replication of segment on new target tiers. | - 
Prevent replication from blocking drop of unneeded segments, thus allowing 
faster decommission of servers and freeing up disk space to load new segments. 
<br> - Fault tolerance is handled as the drops still wait for a replication 
factor of 2 on target tiers (unless rule specifies only 1 replica).
   LoadQueuePeon can distinguish between load of a primary, a replica or a load 
required for balancing. | - Allow prioritization of items, which becomes 
important if load queue size is unlimited. <br> - Avoid considering balancing 
items in load queue as over-replicated.
   `replicationThrottleLimit` does not act on first replica in any tier | 
Throttling first replica on a tier undermines the purpose of tiering. Tiering 
is not meant for fault tolerance, rather serving different query needs. Thus 
segments should be available on target tiers as soon as possible.
   ⚠️ `maxNonPrimaryReplicantsToLoad` does not act on first replica in any tier 
| This was done keeping in line with the changes to `replicationThrottleLimit` 
but it should probably be reverted to prevent unexpected behaviour for clusters 
using this config.
   `maxSegmentsInNodeLoadingQueue` to act per run rather than the current 
number of items in queue | Currently, if the configured load queue size is 
large enough to allow load of some segments while a coordinator run is in 
progress, the load queue limit is violated as there is always some room in the 
queue. This causes coordinator runs to get stuck cycling through all the 
segments in spite of a limited load queue.
   
   #### Structural changes
   
   Change | Motivation
   --------|-----------
   Add `SegmentLoader` which handles all the load, move and drop operations. 
The lifecycle of the loader is tied to a single coordinator run. In the changed 
code, it is instantiated once in every run of `RunRules` and `BalanceSegments`. 
| - Allow reuse of logic for loading, balancing and broadcasting. <br> - Single 
place to maintain state of a single run thus allowing better metrics and 
logging.
   Load rules just specify their desired state and leave the actual decision 
making to the `SegmentLoader`. | Simpler logic for load rules.
   Add `SegmentStateManager` that maintains state across coordinator runs and 
interacts with the load queues. | - Single place to interact with load queue 
and maintain state of all in-flight segments - Allow reporting of metrics from 
queue callbacks. - Prevent callbacks from holding references to items from the 
previous coordinator run.
   
   #### New metrics
   - `segment/cancelLoad/count`
   - `segment/cancelDrop/count`
   - `segment/broadcastLoad/count`
   - `segment/broadcastDrop/count`
   - some more being added
   
   ### Pending changes in this PR
   - Remove old dead code from `LoadRule`, `BroadcastDistributionRule`
   - Fix failing tests
   - Add more UTs and simulation tests for `SegmentLoader`
   - Double-check logs and connect up new metrics
   - Update metric and config documentation
   - Complete self-review
   
   ### Further work
   - Add prioritization strategy inside `LoadQueuePeon`s. Current 
prioritization is based only on time.
   Strategy could be something like:
     - action: drop > load (already implemented)
     - segment interval: new > old (already implemented)
     - load type: primary > replica > broadcast > balancing move
   - Fix balancing strategy to ensure that unlimited load queues do not cause 
coordinator runs to take forever.
   - Allow balancing strategy to pick moving segments (this has been allowed in 
this PR but not enabled in any of the existing strategies)
   - Add metrics to identify time spent by items in load queue
   
   <hr>
   
   This PR has:
   - [ ] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] kfaraz opened a new pull request, #13197: Segment loading: Allow cancellation and prioritization of load queue items

Reply via email to