slfan1989 commented on PR #7499:
URL: https://github.com/apache/ozone/pull/7499#issuecomment-2513491086

   Everyone's points are reasonable, and I understand their perspectives. 
However, I agree more with @errose28 viewpoint. I believe setting this value to 
-1 is the better choice, as users should not face the risk of DN crashes.
   
   Our largest cluster has 1,500 machines, and every day, some machines 
experience disk failures. If a DN crashes directly, as an administrator, my 
first reaction is panic. It takes time to locate the logs, and the DN logs are 
often quite large.
   
   I think we should provide a more detailed description for this 
configuration: if set to -1, the DN will never crash under any circumstances. 
If set to a specific number, it indicates the number of disk failures that can 
be tolerated before the DN crashes. Ultimately, users should be allowed to 
choose based on their specific environment.
   
   Additionally, the situation mentioned by @sodonnel is also reasonable. We 
currently have a special type of machine with 60 data disks. In the event of a 
disk failure, we opt for the hot repair method, which means replacing the disk 
without shutting down. This is because shutting down would trigger a large 
amount of container replication, potentially involving tens of thousands of 
containers. Currently, we only perform a shutdown for repairs in the case of 
CPU failure, memory failure, or system disk failure.
   
   Our current strategy is to configure the system to tolerate a single disk 
failure and perform daily routine inspections. Once a machine with a disk 
failure is identified, we quickly carry out repairs. Therefore, I personally 
believe that we only need to improve the comments for this configuration, 
clearly describing the potential risks involved.
   
   The above is my understanding as a user, and I would also like to hear more 
thoughts from other members of the community.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to