slfan1989 commented on PR #7541: URL: https://github.com/apache/ozone/pull/7541#issuecomment-2533409916
@errose28 @nandakumar131 Thank you very much for your response! However, please allow me to provide some clarification regarding this PR. My intention is not to increase the complexity of SafeMode, but rather to address some issues that have arisen during its use: The reasons I added this feature are as follows: 1. The Ozone SCM does not maintain a complete list of DataNodes, which can lead to a problem. In large clusters, when we restart the SCM, some DataNodes may fail to register, and we won't be able to locate these DataNodes (since they are not included in the DataNode list). Currently, my approach is to manually compare the DataNode lists from the two SCMs, identify the DataNodes that failed to register, track these DataNodes, and take appropriate actions (in most cases, this involves restarting the DataNodes). The purpose of retrieving the DataNode list from the Pipeline List is to identify any unregistered DataNodes, as shown in my screenshot. 2. The reason for improving the DataNodeSafeModeRule is that this rule is difficult to apply effectively in real-world usage. Let me provide a specific example: - Adding DataNodes: When our cluster reaches the 75% threshold, we need to expand the number of DataNodes, typically scaling from 100 machines to 120 or 130. - Reducing DataNodes: Our cluster may include various types of machines, some of which have poor performance. In such cases, we may need to take some machines offline for replacement, which could result in a reduction in the number of DataNodes. So, what value should we set for the parameter `hdds.scm.safemode.min.datanode`? It's quite difficult to assess. A default value of 1 is certainly not ideal, but when we consider a cluster with 100 machines, should we set it to 40, 50, or 60? This is also hard to determine. However, if we set a proportion, the rule would become more flexible. For example, if we expect 60% of the DataNodes to register, this value would dynamically change over time. 3. Grafana is a good solution, but it also has some issues in large-scale cluster environments. The problem is that there are too many metrics for each DataNode. Even with a 30-second collection interval, a single DataNode generates a large of metrics. Currently, we have 5 clusters with over 3,000 DataNodes, which puts significant pressure on our collection system. At the moment, we collect metrics for DataNodes every 5 minutes, which results in delays in reflecting the actual situation in real time. I would really appreciate any suggestions you may have. Thank you once again! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
