sodonnel opened a new pull request, #4636: URL: https://github.com/apache/ozone/pull/4636
## What changes were proposed in this pull request? We should make it possible to configure a global replication limit, limiting the number of inflight containers pending creation. A larger cluster would be capable of having more inflight replication than a smaller cluster, so the limit should be a function of the number of datanodes on the cluster, and the limit of the number of commands which can be queued per datanode and some weighting factor. For example, if each datanode can queue 20 replication commands, and there are 100 nodes in the cluster, then the natural limit is 20 * 100. However, that assumes that commands are queued evenly across all datanodes, which is unlikely. With a global limit we would prefer that all datanodes are not fully loaded with replication commands simultaneously, so we may want to impose a limit of half that number, with a factor of 0.5, eg 20 * 100 * 0.5 = 1k pending replications. At one extreme this would result in all datanodes in the cluster having half their maximum tasks queued, but in practice, some DNs are likely to be at their limit while others have zero or less than half queued. If the limits were perfectly defined, such that in a single heartbeat a datanode can complete all its queued work just at the end of the heartbeat interval, then reducing the number of queued commands by half would make the datanode busy for only half its heartbeat interval. As the datanodes will all heartbeat at different times, all the busy and non-work periods across all the datanodes would combine in a load profile that would show some datanodes are always idle, reducing the overall load on the cluster. This PR introduces a default limit factor of 0.75, which can be disabled by setting hte factor to 0. Only inflight replications are throttled via this mechanism - delete containers are throttled by the per datanode limits introduced some time back. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-8505 ## How was this patch tested? New unit tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
