kfaraz opened a new pull request, #13197:
URL: https://github.com/apache/druid/pull/13197

   Fixes multiple items in #12881
   
   ### Description
   
   This PR lays the ground work to allow load queue to safely have an unlimited 
number of items, and thus
   eventually phase out `maxSegmentsInNodeLoadingQueue` and 
`replicationThrottleLimit`.
   
   Load queue is already allowed to have unlimited items (by setting 
`maxSegmentsInNodeLoadingQueue = 0`)
   but this leads to
   - each coordinator run taking a very long time
   - poor assignments which take several more runs to be rectified
   
   ### Changes
   
   #### Classes to review
   - `SegmentLoader`
   - `LoadRule`, `BroadcastDistributionRule`
   - `SegmentStateManager`
   - `LoadQueuePeon`: http and curator
   - `SegmentHolder`
   - `LoadRuleTest`
   
   #### Behavioral changes
   
   Change | Motivation
   --------|-------------
   Both loaded and loading items count towards replication. | Allow coordinator 
to take corrective action of removing superfluous replicas without waiting for 
them to be fully loaded.
   Load, drop or move operations can be cancelled. | - Allow move of loading 
items from queue of one server to another. <br> - Allow coordinator to take 
corrective actions quickly.
   During tier shift, always maintain the currently configured level of 
replication, no matter which tier it happens to be on. | - Queue drop of 
unneeded segments as soon as possible, thus allowing faster decommission of 
servers and freeing up disk space to load new segments.<br>- Always maintain 
target level of replication, thus ensuring that segment read concurrency does 
not suffer
   LoadQueuePeon prioritizes segment actions as DROP > LOAD > REPLICATE > MOVE 
(i.e. balancing). | - Allow prioritization of items, which becomes important if 
load queue size is unlimited. <br> - Avoid considering balancing items in load 
queue as over-replicated.
   `replicationThrottleLimit` does not act on a tier if the segment is not 
loaded on that tier at all | Throttling first replica on a tier undermines the 
purpose of tiering. Tiering is not meant for fault tolerance, rather serving 
different query needs. Thus segments should be available on target tiers as 
soon as possible.
   ⚠️ `maxNonPrimaryReplicantsToLoad` does not act on first replica in any tier 
| This was done keeping in line with the changes to `replicationThrottleLimit` 
but it should probably be reverted to prevent unexpected behaviour for clusters 
using this config. cc: @capistrant 
   `maxSegmentsInNodeLoadingQueue` acts on the number of items assigned to the 
load queue in the current run rather than the number of items present in the 
queue at a given time. | Currently, if the configured load queue size is large 
enough to allow load of some segments while a coordinator run is in progress, 
the load queue limit is violated as there is always some room in the queue. 
This causes coordinator runs to get stuck cycling through all the segments in 
spite of a limited load queue.
   
   #### Structural changes
   
   Change | Motivation
   --------|-----------
   Add `SegmentLoader` which handles all the load, move and drop operations. 
The lifecycle of the loader is tied to a single coordinator run. In the changed 
code, it is instantiated once in every run of `RunRules` and `BalanceSegments`. 
| - Allow reuse of logic for loading, balancing and broadcasting. <br> - Single 
place to maintain state of a single run thus allowing better metrics and 
logging.
   Load rules just specify their desired state and leave the actual decision 
making to the `SegmentLoader`. | Simpler logic for load rules.
   Add `SegmentStateManager` that maintains state across coordinator runs and 
interacts with the load queues. | - Single place to interact with load queue 
and maintain state of all in-flight segments <br>- Allow reporting of metrics 
from queue callbacks. <br>- Prevent callbacks from holding references to items 
from the previous coordinator run.
   
   #### New metrics
   - `segment/cancelLoad/count`
   - `segment/cancelDrop/count`
   - `segment/broadcastLoad/count`
   - `segment/broadcastDrop/count`
   - some more being added
   
   ### Further work
   - Fix timeout behaviour and corresponding tests
   - Fix balancing strategy to ensure that unlimited load queues do not cause 
coordinator runs to take forever.
   - Allow balancing strategy to pick moving segments (this has been allowed in 
this PR but not enabled in any of the existing strategies)
   - Add metrics to identify time spent by items in load queue
   
   <hr>
   
   This PR has:
   - [ ] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to