slfan1989 commented on code in PR #8405:
URL: https://github.com/apache/ozone/pull/8405#discussion_r2114806961


##########
hadoop-hdds/docs/content/design/degraded-storage-volumes.md:
##########
@@ -0,0 +1,275 @@
+---
+title: Improved Storage Volume Handling for Ozone Datanodes
+summary: Proposal to add a degraded storage volume health state in datanodes.
+date: 2025-05-06
+jira: HDDS-8387
+status: draft
+author: Ethan Rose, Rishabh Patel
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# Improved Storage Volume Handling for Ozone Datanodes
+
+## Background
+
+Currently Ozone uses two health states for storage volumes: **healthy** and 
**failed**. A volume scanner runs on each datanode to determine whether a 
volume should be moved from a **healthy** to a **failed** state. Once a volume 
is failed, all container replicas on that volume are removed from tracking by 
the datanode and considered lost. Volumes cannot return to a healthy state 
after failure without a datanode restart.
+
+This model only works for hard failures in volumes, but in practice most 
volume failures are soft failures. Disk issues manifest in a variety of ways 
and minor problems usually appear before a drive fails completely. The current 
approach to volume scanning and health classification does not account for 
this. If a volume is starting to exhibit signs of failure, the datanode only 
has two options:
+- Fail the volume
+    - In many cases the volume may still be mostly or partially readable. 
Containers on this volume that were still readable would be removed by the 
system and have their redundancy reduced unnecessarily. This is not a safe 
operation.
+- Keep the volume healthy
+    - Containers on this volume will not have extra copies made until the 
container scanner finds corruption and marks them unhealthy, after which we 
have already lost redundancy.
+
+For the common case of soft volume failures, neither of these are good 
options. This document outlines a proposal to classify and handle soft volume 
failures in datanodes.
+
+## Proposal
+
+This document proposes adding a new volume state called **degraded**, which 
will correspond to partially failed volumes. Handling degraded volumes can be 
broken into two problems:
+- **Identification**: Detecting degraded volumes and alerting via metrics and 
reports to SCM and Recon
+- **Remediation**: Proactively making copies of data on degraded volumes and 
preventing new writes before the volume completely fails
+
+This document is primarily focused on identification, and proposes handling 
remediation with a volume decommissioning feature that can be implemented 
independently of volume health state. 
+
+### Tools to Identify Volume Health State
+
+Ozone has access to the following checks from the volume scanner to determine 
volume health. Most of these checks are already present.
+
+#### Directory Check
+
+This check verifies that a directory exists at the specified location for the 
volume, and that the datanode has read, write, and execute permissions on the 
directory.
+
+#### Database Check
+
+This check only applies to container data volumes (called `HddsVolumes` in the 
code). It checks that a new read handle can be acquired for the RocksDB 
instance on that volume, in addition to the write handle the process is 
currently holding. It does not use any RocksDB APIs that do individual SST file 
checksum validation, like paranoid checks. corruption within individual SST 
files will only affect the keys in those files, and RocksDB verifies checksums 
for individual keys on each read. This makes SST file checksum errors isolated 
to a per-container level and they will be detected by the container scanner and 
cause the container to be marked unhealthy.
+
+#### File Check
+
+This check runs the following steps:
+1. Generates a fixed amount of data and keeps it in memory
+2. Writes the data to a file on the disk
+3. Syncs the file to the disk to touch the hardware
+4. Reads the file back to ensure the contents match what was in memory
+5. Deletes the file
+
+Of these, the file sync is the most important check, because it ensures that 
the disk is still reachable. This detects a dangerous condition where the disk 
is no longer present, but data remains readable and even writeable (if sync is 
not used) due to in-memory caching by the OS and file system. The cached data 
may cease to be reachable at any time, and should not be counted as valid 
replicas of the data.
+
+#### IO Error Count
+
+This would be a new check that can be used as part of this feature. Currently 
each time datanode IO encounters an error, we request an on-demand volume scan. 
This should include every time the container scanner marks a container 
unhealthy. We can keep a counter of how many IO errors have been reported on a 
volume over a given time frame, regardless of whether the corresponding volume 
scan passed or failed. This accounts for cases that may show up on the main IO 
path but may otherwise not be detected by the volume scanner. For example, 
numerous old sectors with existing container data may be unreadable. The volume 
scanner's **File Check** will only utilize new disk sectors so it will still 
pass with these errors present, but the container scanner may be hitting many 
bad sectors across containers, which this check will account for.
+
+#### Sliding Window
+
+Most checks will encounter intermittent issues, even on overall healthy 
drives, so we should not downgrade volume health state after just one error. 
The current volume scanner uses a counter based sliding window for intermittent 
failues, meaning the volume will be failed if `x` out of the last `y` checks 
failed, regardless of when they occurred. This approach works for background 
volume scans, because `y` is the number of times the check ran, and `x` is the 
number of times it failed. It does not work if we want to apply a sliding 
window to on-demand checks like IO error count that do not care if the 
corresponding volume scan passed or failed.
+To handle this, we can switch to time based sliding windows to determine when 
a threshold of tolerable errors is crossed. For example, if this check has 
failed `x` times in the last `y` minutes, we should consider the volume 
degraded.
+
+We can use one time based sliding window to track errors that would cause a 
volume to be degraded, and a second one for errors that would cause a volume to 
be failed. When a check fails, it can add the result to whichever sliding 
window it corresponds to. We can create the following assignments of checks:
+
+- **Directory Check**: No sliding window required. If the volume is not 
present based on filesystem metadata it should be failed immediately.
+- **Database Check**: On failure, add an entry to the **failed health sliding 
window**
+- **File Check**:
+    - If the sync portion of the check fails, add an entry to the **failed 
health sliding window**
+    - If any other part of this check fails, add an entry to the **degraded 
health sliding window**
+- **IO Error Count**: When an on-demand volume scan is requested, add an entry 
to the **degraded health sliding window**

Review Comment:
   Thank you very much for the detailed response — it gave me a much clearer 
understanding of the design logic behind this feature. Overall, it seems that 
your considerations are already quite thorough, and I'm looking forward to 
seeing it implemented.
   
   I'd also like to add one more thought from my side:
   
   >> These disks often exhibit intermittent failures—functioning normally 
during certain time windows and frequently failing during others. Should we 
consider further optimizing the configuration of the sliding window mechanism 
to avoid repeatedly triggering the degraded state due to error fluctuations 
that have not yet escalated, thereby preventing unnecessary data replication or 
alerts?
   
   > The sliding window of the degraded state is intended to deal with this 
exact situation: intermittent errors that are not enough to escalate to full 
volume failure. Degraded is the only state that can move back to healthy, so 
there would be no fluctuation of volumes from failed to healthy triggering 
re-replication, only possible fluctuation from degraded to healthy. In this 
case it just provides more monitoring options. The current system provides no 
optics into intermittent volume errors, so it is as if all of these types of 
alerts are ignored. If the concern is with spurious alerts, then alerting can 
be ignored for degraded volume metrics, which puts it on par with the current 
system. The sliding windows can also be tuned to adjust how sensitive the disk 
is to health state changes.
   
   Regarding point 1 on optimizing the sliding window mechanism — I fully agree 
with your explanation, and it's clear that the current design addresses 
intermittent errors.
   
   However, I do have a follow-up question: how exactly does a volume 
transition from degraded to failed? Is there a clearly defined threshold or set 
of criteria for this transition? 
   
   >> In addition, I have another concern: I believe that certain 
administrative operations themselves can contribute to performance degradation 
in Datanodes. Tasks such as disk scanning and data recovery introduce 
additional I/O overhead, especially when the disk is already under stress.
   
   >> Would it be possible to introduce a pre-warning mechanism that can 
proactively detect potential disk degradation based on performance trends, 
before actual failure thresholds are reached? For example, if a disk's 
read/write latency or throughput is significantly worse than other disks on the 
same node, could the system flag it as "performance abnormal" or "under 
observation" and trigger an alert?
   
   > This would be a good detection mechanism, but I'm not sure it needs to be 
handled within Ozone. Ozone can and should report issues it sees while 
operating, but IO wait can be detected by other systems like smartctl, iostat, 
and prometheus node exporter. We don't need to re-invent the wheel within Ozone 
when we have these dedicated tools available.
   
   > This is a good point. Right now such situations may cause the volume to be 
marked as degraded for alerting purposes, but should not fail the volume. 
Container import/export and container scanner can have their bandwidth 
throttled with configs if those operations themselves are burdening the node to 
the point where it is unhealthy.
   
   Regarding point 2, your explanation has largely addressed my concerns. 
However, I’m wondering if we could take it a step further by supporting dynamic 
configuration of bandwidth limits for these operations. In real-world 
scenarios, we’ve observed cases where disk scanning introduced I/O pressure 
that affected normal read/write performance. Allowing bandwidth limits to be 
adjusted at runtime based on node load could help better balance stability and 
performance.
   
   >> What is the current time interval configured for the sliding window? If 
the interval is too short, it may lead to frequent state changes due to 
temporary fluctuations. If it's too long, it might delay fault detection and 
cause us to miss the optimal window for intervention.
   
   > Yes I will add a proposal for specific values in the document, although it 
will be a tricky to pick a "best" value. I'm still working on this area and 
will update the doc soon.
   
   Regarding point 3, I fully understand that it’s difficult to define a value, 
as disk usage patterns and environments can vary significantly across 
deployments. That said, I believe it would be helpful to include a clear 
explanation in the documentation. Speaking from personal experience, when I 
come across a critical configuration parameter, I really appreciate seeing a 
detailed description — for example, how increasing or decreasing the value 
would affect system behavior. This kind of guidance makes it much easier for 
users to understand the design rationale and make informed tuning decisions.
   
   >> Would it be possible to introduce a pre-warning mechanism that can 
proactively detect potential disk degradation based on performance trends, 
before actual failure thresholds are reached? For example, if a disk's 
read/write latency or throughput is significantly worse than other disks on the 
same node, could the system flag it as "performance abnormal" or "under 
observation" and trigger an alert?
   
   > This would be a good detection mechanism, but I'm not sure it needs to be 
handled within Ozone. Ozone can and should report issues it sees while 
operating, but IO wait can be detected by other systems like smartctl, iostat, 
and prometheus node exporter. We don't need to re-invent the wheel within Ozone 
when we have these dedicated tools available.
   
   Regarding point 4, I think you raised a great point, and I generally agree 
with your approach. However, I’d like to offer an additional perspective.
   
   While there are indeed many external tools available for monitoring I/O 
performance, relying entirely on them can lead to a fragmented view of system 
health. Monitoring data becomes scattered across multiple sources, and I 
personally believe that it would be more effective if Ozone could provide some 
built-in, conclusive metrics to help assess disk health directly — rather than 
requiring SREs to piece together information from various systems to make a 
judgment.
   
   I’ve experienced this challenge firsthand. When users report performance 
issues in Ozone — especially in scenarios where performance is critical — I 
often find myself digging through different metrics and dashboards to locate 
the root cause. This process is time-consuming and mentally taxing. If Ozone 
could consolidate key signals and present them in a unified way, it would 
significantly improve troubleshooting efficiency and reliability.
   
   Take I/O performance as an example — we can retrieve read/write latency or 
throughput data simply by reading certain system files. This doesn’t require 
much effort or any complex tooling. In fact, I’ve already made some progress on 
this in #7273 , where I exposed some of these metrics directly through Ozone’s 
built-in metrics system. This kind of integration is much more intuitive, 
centralized, and operationally helpful.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

Reply via email to