slfan1989 commented on PR #7499: URL: https://github.com/apache/ozone/pull/7499#issuecomment-2513491086
Everyone's points are reasonable, and I understand their perspectives. However, I agree more with @errose28 viewpoint. I believe setting this value to -1 is the better choice, as users should not face the risk of DN crashes. Our largest cluster has 1,500 machines, and every day, some machines experience disk failures. If a DN crashes directly, as an administrator, my first reaction is panic. It takes time to locate the logs, and the DN logs are often quite large. I think we should provide a more detailed description for this configuration: if set to -1, the DN will never crash under any circumstances. If set to a specific number, it indicates the number of disk failures that can be tolerated before the DN crashes. Ultimately, users should be allowed to choose based on their specific environment. Additionally, the situation mentioned by @sodonnel is also reasonable. We currently have a special type of machine with 60 data disks. In the event of a disk failure, we opt for the hot repair method, which means replacing the disk without shutting down. This is because shutting down would trigger a large amount of container replication, potentially involving tens of thousands of containers. Currently, we only perform a shutdown for repairs in the case of CPU failure, memory failure, or system disk failure. Our current strategy is to configure the system to tolerate a single disk failure and perform daily routine inspections. Once a machine with a disk failure is identified, we quickly carry out repairs. Therefore, I personally believe that we only need to improve the comments for this configuration, clearly describing the potential risks involved. The above is my understanding as a user, and I would also like to hear more thoughts from other members of the community. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
