Re: [PR] HDDS-11770. Change default failed volume tolerated to 0 [ozone]

via GitHub Thu, 28 Nov 2024 06:31:32 -0800


sodonnel commented on PR #7499:
URL: https://github.com/apache/ozone/pull/7499#issuecomment-2506255215


   Shutting down a 20 disk DN due to one disk failure is not a good default for 
me. The disruption caused by an immediate loss of a large node is significant, 
and will result in a lot of needless replication.
   
   Lets say the disk is a legitimate hardware failure and needs replaced. Many 
of these hosts have hot swappable drives, the node could probably run on for 
days or weeks until the admins decide to remedy all the failed drives in the 
cluster. Disk failures are something that are expected to happen somewhat 
regularly, so we should handle them much more gracefully.
   
   I'd probably suggest something like a default of "if 50% - 75% of the 
configured drives fail", shutdown the DN., but there is an argument for keeping 
it running with only a single disk left too. The DNs should be monitored for 
disk failures and they should be investigated. It should not need a DN to 
shutdown to make that happen.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-11770. Change default failed volume tolerated to 0 [ozone]

Reply via email to