slfan1989 commented on PR #7541:
URL: https://github.com/apache/ozone/pull/7541#issuecomment-2533409916

   @errose28 @nandakumar131 
   
   Thank you very much for your response! However, please allow me to provide 
some clarification regarding this PR. My intention is not to increase the 
complexity of SafeMode, but rather to address some issues that have arisen 
during its use:
   
   The reasons I added this feature are as follows:
   
   1. The Ozone SCM does not maintain a complete list of DataNodes, which can 
lead to a problem. In large clusters, when we restart the SCM, some DataNodes 
may fail to register, and we won't be able to locate these DataNodes (since 
they are not included in the DataNode list).
   
   Currently, my approach is to manually compare the DataNode lists from the 
two SCMs, identify the DataNodes that failed to register, track these 
DataNodes, and take appropriate actions (in most cases, this involves 
restarting the DataNodes).
   
   The purpose of retrieving the DataNode list from the Pipeline List is to 
identify any unregistered DataNodes, as shown in my screenshot.
   
   2. The reason for improving the DataNodeSafeModeRule is that this rule is 
difficult to apply effectively in real-world usage. Let me provide a specific 
example:
   
   - Adding DataNodes:  
   When our cluster reaches the 75% threshold, we need to expand the number of 
DataNodes, typically scaling from 100 machines to 120 or 130.
   
   - Reducing DataNodes:  
   Our cluster may include various types of machines, some of which have poor 
performance. In such cases, we may need to take some machines offline for 
replacement, which could result in a reduction in the number of DataNodes.
   
   So, what value should we set for the parameter 
`hdds.scm.safemode.min.datanode`? It's quite difficult to assess. A default 
value of 1 is certainly not ideal, but when we consider a cluster with 100 
machines, should we set it to 40, 50, or 60? This is also hard to determine.
   
   However, if we set a proportion, the rule would become more flexible. For 
example, if we expect 60% of the DataNodes to register, this value would 
dynamically change over time.
   
   3. Grafana is a good solution, but it also has some issues in large-scale 
cluster environments. The problem is that there are too many metrics for each 
DataNode. Even with a 30-second collection interval, a single DataNode 
generates a large of metrics. Currently, we have 5 clusters with over 3,000 
DataNodes, which puts significant pressure on our collection system.  At the 
moment, we collect metrics for DataNodes every 5 minutes, which results in 
delays in reflecting the actual situation in real time.
   
   I would really appreciate any suggestions you may have. Thank you once again!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to