kfaraz opened a new pull request, #13114: URL: https://github.com/apache/druid/pull/13114
Fixes item (1) in #12881 This PR must be merged after https://github.com/apache/druid/pull/13074 ### Description The current implementation of segment balancing always often leads to over-replication. This can happen in the following (fairly common) scenario: - Segment is moving from server A to B for balancing - Load succeeds on B, callback is invoked to finish the move - If inventory view shows the segment loaded on B, drop on A is invoked - If inventory view for B is not updated yet, the drop on A is not invoked This is a frequent occurrence but under normal circumstances, it is not noticeable because load rules quickly drop the over-replicated segment from A. But if load rules get stuck for some reason (e.g. trying to reach target replication levels on a different tier), the number of over-replicated segments keeps increasing, thereby overloading historicals.  ### Changes 1. For http loading, do not check the server inventory to determine if the drop should be invoked. The HTTP response status is enough to determine load success or failure. 2. For zk-based loading, - continue to check the inventory because behaviour of zk loading is a little more error prone - remove path check in the callback as it is redundant given that the inventory is already being checked - some minor cleanup in the peon 3. Fix the corresponding test in `SegmentLoadingNegativeTest` <hr> This PR has: - [ ] been self-reviewed. - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.) - [ ] added documentation for new or modified features or behaviors. - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md) - [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. - [ ] added integration tests. - [ ] been tested in a test Druid cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
