weimingdiit commented on PR #7010:
URL: https://github.com/apache/ozone/pull/7010#issuecomment-2343179280

   > For this to work, you probably also need to adjust the number of 
replication threads in the DNs, otherwise the requests will simply queue at the 
DN side.
   > 
   > I am also not sure about the motivation for this - holding back 
replication during peak could end up with data loss if the problems are not 
repaired quickly enough. It also feels like the same goal could be achieved by 
making the replication parameter dynamically configurable and then adjusting 
them with an external command without needing to restart any services.
   > 
   > A better solution, although much more difficult, is that the cluster can 
adjust the replication rate based on the load the cluster is under.
   > 
   > Additionally, in many clusters, there could be full days that are off peak 
(eg Saturday and Sunday) plus during the night.
   > 
   > I feel it would be better to give this some more thought about other ways 
of solving the problem.
   
   
   @sodonnel  Thank you for your comments and suggestions.
   
   I think the solution to this issue could be divided into two steps:
   
   Step 1: [It also feels like the same goal could be achieved by making the 
replication parameter dynamically configurable and then adjusting them with an 
external command without needing to restart any services.]
   
   I agree with this approach. In this way, the two newly added parameters in 
the aforementioned PR are unnecessary. We just need to ensure that the key 
parameters related to replication in SCM and DN are dynamically configurable, 
and then control them through external scripts. This method should solve most 
of the issues.
   
   Step 2: [A better solution, although much more difficult, is that the 
cluster can adjust the replication rate based on the load the cluster is under.]
   
   As you mentioned, it requires fully dynamic control of the entire 
replication process based on certain load data, finding a proper balance 
between replication speed and the read-write latency caused by IO. But what 
metrics should be collected from the DN in this case? Memory, CPU, IO (perhaps 
the most important)?  Based on these metrics, we could determine which nodes 
should handle the replication and at what speed. This method is indeed more 
elegant.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to