kfaraz opened a new pull request, #13114:
URL: https://github.com/apache/druid/pull/13114

   Fixes item (1) in #12881 
   
   This PR must be merged after https://github.com/apache/druid/pull/13074
   
   ### Description
   
   The current implementation of segment balancing always often leads to 
over-replication.
   
   This can happen in the following (fairly common) scenario:
   - Segment is moving from server A to B for balancing
   - Load succeeds on B, callback is invoked to finish the move
   - If inventory view shows the segment loaded on B, drop on A is invoked
   - If inventory view for B is not updated yet, the drop on A is not invoked
   
   This is a frequent occurrence but under normal circumstances, it is not 
noticeable
   because load rules quickly drop the over-replicated segment from A. But if 
load rules get
   stuck for some reason (e.g. trying to reach target replication levels on a 
different tier),
   the number of over-replicated segments keeps increasing, thereby overloading 
historicals.
   
   
![over_replication](https://user-images.githubusercontent.com/18635897/183566987-6b3a2be9-05cc-4f70-9e4f-3da9796e2710.png)
   
   ### Changes
   1. For http loading, do not check the server inventory to determine if the 
drop should be invoked.
   The HTTP response status is enough to determine load success or failure.
   2. For zk-based loading,
       - continue to check the inventory because behaviour of zk loading is a 
little more error prone
       - remove path check in the callback as it is redundant given that the 
inventory is already being checked
       - some minor cleanup in the peon
   3. Fix the corresponding test in `SegmentLoadingNegativeTest`
   
   <hr>
   
   This PR has:
   - [ ] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [x] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to