capistrant commented on PR #11135: URL: https://github.com/apache/druid/pull/11135#issuecomment-1206851194
> @capistrant , I was taking a look at the `maxNonPrimaryReplicantsToLoad` config but I couldn't really distinguish it from `replicationThrottleLimit`. > > I see that you have made a similar observation here: > > > I folded this new configuration and feature into ReplicationThrottler. That is essentially what it is doing, just in a new way compared to the current ReplicationThrottler functionality. > > Could you please help me understand the difference between the two? In which case would we want to tune this config rather than tuning the `replicationThrottleLimit` itself? My observation is that `maxNonPrimaryReplicantsToLoad` is a new way of throttling replication. Not that it is doing the same thing as `replicationThrottleLimit` `replicationThrottleLimit` is a limit on the number of in-progress replica loads at any one time during RunRules. We tack the in-progress loads in a list. Items are removed from said list when a `LoadQueuePeon` issues a callback to remove them on completion of the load. `maxNonPrimaryReplicantsToLoad` is a hard limit on the number of replica loads during RunRules. Once it is hit, there is no more non-primary replicas created for the rest of RunRules. You'd want to tune `maxNonPrimaryReplicantsToLoad` if you want to put an upper bound on the work to load non-primary replicas done by the coordinator per execution of RunRules. The reason we use it at my org is because we want the coordinator to avoid "putting it's head in the sand" and loading replicas for an un-desirable amount of time instead of finishing it's duties and refreshing its metadata. An example of an "un-desirable amount of work" is if a Historical drops out of the cluster momentarily while the Coordinator is refreshing its `SegmentReplicantLookup`. The coordinator all of a sudden thinks X segment are under-replicated. But if the Historical is coming back online (say after a restart to deploy new configs), we don't want the Coordinator to spin and load those X segments when it could just finish its duties and notice that the segments are not under-replicated anymore. I'm not aware of reasons for using `replicationThrottleLimit`. It didn't meet my orgs needs for throttling replication and it is why I introduced the new config. I guess it is a way to avoid flooding the cluster with replica loads? My clusters have actually tuned that value up to avoid hitting it at the low default that exists. We don't care about the number of in-flight loads, we just care about limiting the total number of replica loads per RunRules execution. Let me know if that clarification is still not making sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
