weimingdiit commented on PR #7010: URL: https://github.com/apache/ozone/pull/7010#issuecomment-2343179280
> For this to work, you probably also need to adjust the number of replication threads in the DNs, otherwise the requests will simply queue at the DN side. > > I am also not sure about the motivation for this - holding back replication during peak could end up with data loss if the problems are not repaired quickly enough. It also feels like the same goal could be achieved by making the replication parameter dynamically configurable and then adjusting them with an external command without needing to restart any services. > > A better solution, although much more difficult, is that the cluster can adjust the replication rate based on the load the cluster is under. > > Additionally, in many clusters, there could be full days that are off peak (eg Saturday and Sunday) plus during the night. > > I feel it would be better to give this some more thought about other ways of solving the problem. @sodonnel Thank you for your comments and suggestions. I think the solution to this issue could be divided into two steps: Step 1: [It also feels like the same goal could be achieved by making the replication parameter dynamically configurable and then adjusting them with an external command without needing to restart any services.] I agree with this approach. In this way, the two newly added parameters in the aforementioned PR are unnecessary. We just need to ensure that the key parameters related to replication in SCM and DN are dynamically configurable, and then control them through external scripts. This method should solve most of the issues. Step 2: [A better solution, although much more difficult, is that the cluster can adjust the replication rate based on the load the cluster is under.] As you mentioned, it requires fully dynamic control of the entire replication process based on certain load data, finding a proper balance between replication speed and the read-write latency caused by IO. But what metrics should be collected from the DN in this case? Memory, CPU, IO (perhaps the most important)? Based on these metrics, we could determine which nodes should handle the replication and at what speed. This method is indeed more elegant. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
