errose28 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2110024650


##########
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##########
@@ -0,0 +1,212 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+    - In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+    - Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.

Review Comment:
   > So here is a situation: I hit a bad sector, and an IO error is reported, 
which triggers an on-demand scan: the value of X is incremented. Now, in the 
current behavior, RM replicates the good replicas from other sources 
immediately. So, full durability is restored by the system.
   > With the proposed model, I have compromised durability because until my 
window length of (x-y) is hit, my container has only 2 good copies elsewhere. 
   
   This would still happen in the proposed model. There are no proposed changes 
to replication manager or container states in this document. I think there is 
some confusion between the on-demand container scanner and on-demand volume 
scanners here as well. On-demand container scanner will be triggered when a bad 
sector is read within the container, and if that fails it will mark the 
container unhealthy triggering the normal replication process. There is no 
sliding window for the on-demand container scanner.
   
   What is proposed in this doc is that if the on-demand container scanner 
marks a container unhealthy, it should also trigger an on-demand volume scan. 
For each on-demand volume scan requested, it would add a counter towards the 
degraded state sliding window of that volume.
   
   > Instead, a more desirable situation is if X = 1, degraded volume has the 
last copy of the container, RM replicated from this as the source, rest of the 
behavior is left identical.
   
   If there is only one copy of a container then it is already under-replicated 
and RM will copy from this volume as long as it is not failed. This doc does 
not propose any changes here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to