Ethan Rose created HDDS-11943:
---------------------------------
Summary: Fail storage volume after numerous reported IO errors
Key: HDDS-11943
URL: https://issues.apache.org/jira/browse/HDDS-11943
Project: Apache Ozone
Issue Type: Sub-task
Reporter: Ethan Rose
Currently on-demand volume scanning is triggered for IO errors encountered
while the cluster is running, but the volume can only be failed by a
configurable number of volume scans failures.
The volume scanner syncs a file to the disk and reads it back. This itself
alone not catch some types of volume failures. For example, if older sectors of
a disk that have already been written to are failing for reads, the container
scanner will keep raising errors and marking containers unhealthy, but the
corresponding volume scans will always write their file to new sectors that
don't have errors.
To fix this, we can keep a counter of how many IO errors have been reported
from on-demand scan requests for a volume. If that number crosses a
configurable count, we can fail the volume even if volume scans are passing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]