[GitHub] [druid] capistrant opened a new pull request #11135: Create dynamic config that can limit number of non-primary replicants loaded per coordination cycle

GitBox Mon, 19 Apr 2021 16:10:58 -0700


capistrant opened a new pull request #11135:
URL: https://github.com/apache/druid/pull/11135

<!-- If you are a committer, follow the PR action item checklist for
committers:

https://github.com/apache/druid/blob/master/dev/committer-instructions.md#pr-and-issue-action-item-checklist-for-committers.
-->

### Description

Add a new dynamic configuration to the coordinator that gives an operator
the power to set a hard limit for the number of non-primary segment replicas
that are loaded during a single execution of `RunRules#run`. This allows the
operator to limit the amount of work loading non-primary replicas that
`RunRules` will execute in a single run. An example of a reason to use a
non-default value for this new config is if the operator wants to ensure that
major events such as historical service(s) leaving the cluster, large ingestion
jobs, etc. do not cause an abnormally long `RunRules` execution compared to the
cluster's baseline runtime.

**Example**

cluster: 3 historical servers in _default_tier with 18k segments per server.
Each segment belongs to a datasource that has the load rule "LoadForever 2
replicas on _default_tier". The cluster load status is 100% loaded.

Event: 1 historical drops out of the cluster.

Today: The coordinator will load all 18k segments that are now
under-replicated in a single execution of RunRules (as long as Throttling
limits are not hit and there is capacity)

My change: The coordinator can load a limited number of these
under-replicated segments IF the operator has tuned the new dynamic config down
from its default. For instance, the operator could say that it is 2k. Meaning
it would take at least 9 coordination cycles to fully replicate the segments
that were on the recently downed host.

**Why**

Operators need to balance lots of competing needs. Having the cluster fully
replicated is great for HA. But if an event causes the coordinator to take 20
minutes to fully replicate because it has to load thousands of replicas, we
sacrifice the timeliness of loading newly ingested segments that were inserted
into the metastore after this long coordination cycle started. Maybe the
operator cares more about that fresh data timeliness than the replication
status, so they change the new config to a value that causes RunRules to take
less time but require more execution cycles to bring the data back to full
replication.

Really what the change aims to do is give an operator more flexibility. As
written the default would give the operator the exact same functionality that
they see today.

**Design**

I folded this new configuration and feature into ReplicationThrottler. That
is essentially what it is doing, just in a new way compared to the current
ReplicationThrottler functionality.

<hr>

##### Key changed/added classes in this PR
* `CoordinatorDynamicConfig`
* `ReplicationThrottler`
* `RunRules`
* `LoadRule`

<hr>

This PR has:
- [ ] been self-reviewed.
- [ ] using the [concurrency
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
(Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [ ] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] capistrant opened a new pull request #11135: Create dynamic config that can limit number of non-primary replicants loaded per coordination cycle

Reply via email to