[jira] [Created] (HDDS-11943) Fail storage volume after numerous reported IO errors

Ethan Rose (Jira) Mon, 16 Dec 2024 15:15:17 -0800

Ethan Rose created HDDS-11943:
---------------------------------

             Summary: Fail storage volume after numerous reported IO errors
                 Key: HDDS-11943
                 URL: https://issues.apache.org/jira/browse/HDDS-11943
             Project: Apache Ozone
          Issue Type: Sub-task
            Reporter: Ethan Rose



Currently on-demand volume scanning is triggered for IO errors encountered 
while the cluster is running, but the volume can only be failed by a 
configurable number of volume scans failures.

The volume scanner syncs a file to the disk and reads it back. This itself 
alone not catch some types of volume failures. For example, if older sectors of 
a disk that have already been written to are failing for reads, the container 
scanner will keep raising errors and marking containers unhealthy, but the 
corresponding volume scans will always write their file to new sectors that 
don't have errors.

To fix this, we can keep a counter of how many IO errors have been reported 
from on-demand scan requests for a volume. If that number crosses a 
configurable count, we can fail the volume even if volume scans are passing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (HDDS-11943) Fail storage volume after numerous reported IO errors

Reply via email to