sodonnel commented on PR #7499: URL: https://github.com/apache/ozone/pull/7499#issuecomment-2506255215
Shutting down a 20 disk DN due to one disk failure is not a good default for me. The disruption caused by an immediate loss of a large node is significant, and will result in a lot of needless replication. Lets say the disk is a legitimate hardware failure and needs replaced. Many of these hosts have hot swappable drives, the node could probably run on for days or weeks until the admins decide to remedy all the failed drives in the cluster. Disk failures are something that are expected to happen somewhat regularly, so we should handle them much more gracefully. I'd probably suggest something like a default of "if 50% - 75% of the configured drives fail", shutdown the DN., but there is an argument for keeping it running with only a single disk left too. The DNs should be monitored for disk failures and they should be investigated. It should not need a DN to shutdown to make that happen. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
