[ 
https://issues.apache.org/jira/browse/HDDS-14871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devesh Kumar Singh updated HDDS-14871:
--------------------------------------
    Description: 
## Problem

`StorageVolumeChecker.checkAllVolumes()` waits on a single `CountDownLatch` for 
all volume health checks to complete. If the latch expires before any volume 
finishes — due to any transient stall — **every pending volume is immediately 
marked FAILED** with zero tolerance, producing false-positive volume failures.

The existing per-volume IO-failure sliding window in `StorageVolume.check()` 
does not address this because it only applies when a check **completes**, not 
when the latch times out.

## Solution

Add a per-volume consecutive latch-timeout counter (`consecutiveTimeoutCount`) 
to `StorageVolume`. When `checkAllVolumes()` latch expires and a volume has not 
yet reported a result, its counter is incremented. The volume is only added to 
the failed set if `count > hdds.datanode.disk.check.timeout.tolerated`. A 
successful check resets the counter to 0.

Volumes that explicitly return `FAILED` from `check()` (genuine IO failures, 
missing directory, bad permissions) are unaffected and continue to fail 
immediately.

## New Configuration

| Key | Default | Description |
|-----|---------|-------------|
| `hdds.datanode.disk.check.timeout.tolerated` | `1` | Number of consecutive 
latch timeouts allowed per volume before marking it FAILED. Set to `0` to 
restore current behavior. |

  was:
h2. Summary

When {{StorageVolumeChecker.checkAllVolumes()}} hits the global latch timeout 
({{{}hdds.datanode.disk.check.timeout{}}}), the implementation currently treats 
*every* volume that has not yet reported as FAILED in one shot. This JIRA 
implements per-volume *consecutive* timeout tolerance so the first timeout 
round can be tolerated and the volume is only failed after tolerance is 
exceeded.
h2. Proposed solution
 * Add config: {{hdds.datanode.disk.check.timeout.tolerated}} (default: 1).
  *{*}Meaning:{*}* allow up to *N* consecutive timeout rounds per volume before 
marking failed; fail when timeouts exceed tolerance (e.g. default 1 → fail on 
second consecutive timeout for that volume).
 * Per {{StorageVolume}} (or appropriate type): {{timeoutCount}} + 
{{recordCheckTimeout()}} / {{resetTimeoutCount()}} (reset when a volume 
completes a healthy check in a finished round).
 * {{{}StorageVolumeChecker.checkAllVolumes(){}}}: on latch timeout, for 
pending volumes, only add to returned failed set if {{recordCheckTimeout()}} 
indicates tolerance exceeded; always include volumes already in explicit 
{{failedVolumes}} from {{{}VolumeCheckResult.FAILED{}}}.

 


> DataNode: tolerate per-volume health-check latch timeouts before marking 
> volumes failed
> ---------------------------------------------------------------------------------------
>
>                 Key: HDDS-14871
>                 URL: https://issues.apache.org/jira/browse/HDDS-14871
>             Project: Apache Ozone
>          Issue Type: Task
>          Components: Ozone Datanode
>            Reporter: Devesh Kumar Singh
>            Assignee: Devesh Kumar Singh
>            Priority: Major
>
> ## Problem
> `StorageVolumeChecker.checkAllVolumes()` waits on a single `CountDownLatch` 
> for all volume health checks to complete. If the latch expires before any 
> volume finishes — due to any transient stall — **every pending volume is 
> immediately marked FAILED** with zero tolerance, producing false-positive 
> volume failures.
> The existing per-volume IO-failure sliding window in `StorageVolume.check()` 
> does not address this because it only applies when a check **completes**, not 
> when the latch times out.
> ## Solution
> Add a per-volume consecutive latch-timeout counter 
> (`consecutiveTimeoutCount`) to `StorageVolume`. When `checkAllVolumes()` 
> latch expires and a volume has not yet reported a result, its counter is 
> incremented. The volume is only added to the failed set if `count > 
> hdds.datanode.disk.check.timeout.tolerated`. A successful check resets the 
> counter to 0.
> Volumes that explicitly return `FAILED` from `check()` (genuine IO failures, 
> missing directory, bad permissions) are unaffected and continue to fail 
> immediately.
> ## New Configuration
> | Key | Default | Description |
> |-----|---------|-------------|
> | `hdds.datanode.disk.check.timeout.tolerated` | `1` | Number of consecutive 
> latch timeouts allowed per volume before marking it FAILED. Set to `0` to 
> restore current behavior. |



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to