kfaraz opened a new pull request, #14269:
URL: https://github.com/apache/druid/pull/14269
## Changes
The defaults of the following config values in the
`CoordinatorDynamicConfig` are being updated.
#### 1. `maxSegmentsInNodeLoadingQueue = 500` (previous = 100)
Rationale: With round-robin segment assignment now being the default
assignment technique, the Coordinator can assign a large number of
under-replicated/unavailable segments very quickly. Before round-robin, a large
queue size would cause the Coordinato to get stuck in `RunRules` duty due to
very slow strategy-based cost computations.
#### 2. `replicationThrottleLimit = 500` (previous = 10)
Rationale: Along with the reasoning given for
`maxSegmentsInNodeLoadingQueue`, a very low `replicationThrottleLimit` can
cause clusters to be very slow in getting to full replication, even when there
are loading threads sitting idle.
Note: It is okay to keep this value equal to
`maxSegmentsInNodeLoadingQueue`. Even with equal values, load queues will not
get filled up with just replicas, and segments that are completely unavailable
will still get a fair chance. This is because while MSINLQ applies to a single
server, `replicationThrottleLimit` applies to each tier.
#### 3. `maxSegmentsToMove = 100` (previous = 5)
Rationale: A very low value of this config (say 5) turns out to be very
ineffective in balancing especially if there are a large number of segments in
a cluster and/or a large skew between usages of two historical servers.
On the other hand, a very large value can cause excessive moves every
minute, which might have the following disadvantages:
- Load of moving segments competing with load of
unavailable/under-replicated segments
- Unnecessary network costs due to constant download and delete of segments
These defaults will be revisited after #13197 is merged.
## Testing
These values have been tried on different production cluster sizes, and have
been found to give satisfactory results.
#### Release note
Update default values of the following coordinator dynamic configs:
- `maxSegmentsInNodeLoadingQueue = 500`
- `maxSegmentsToMove = 100`
- `replicationThrottleLimit = 500`
<hr>
<!-- Check the items by putting "x" in the brackets for the done things. Not
all of these items apply to every PR. Remove the items which are not done or
not relevant to the PR. None of the items from the checklist below are strictly
necessary, but it would be very helpful if you at least self-review the PR. -->
This PR has:
- [ ] been self-reviewed.
- [ ] using the [concurrency
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
(Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [ ] a release note entry in the PR description.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [ ] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]