[ 
https://issues.apache.org/jira/browse/HDDS-14871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devesh Kumar Singh updated HDDS-14871:
--------------------------------------
    Description: 
## Problem

`StorageVolumeChecker.checkAllVolumes()` waits on a single `CountDownLatch` for 
all volume health checks to complete. If the latch expires before any volume 
finishes — due to any transient stall — **every pending volume is immediately 
marked FAILED** with zero tolerance, producing false-positive volume failures.

The existing per-volume IO-failure sliding window in `StorageVolume.check()` 
does not address this because it only applies when a check **completes**, not 
when the latch times out.

## Solution

Add a per-volume consecutive latch-timeout counter (`consecutiveTimeoutCount`) 
to `StorageVolume`. When `checkAllVolumes()` latch expires and a volume has not 
yet reported a result, its counter is incremented. The volume is only added to 
the failed set if `count > hdds.datanode.disk.check.timeout.tolerated`. A 
successful check resets the counter to 0.

Volumes that explicitly return `FAILED` from `check()` (genuine IO failures, 
missing directory, bad permissions) are unaffected and continue to fail 
immediately.

  was:
## Problem

`StorageVolumeChecker.checkAllVolumes()` waits on a single `CountDownLatch` for 
all volume health checks to complete. If the latch expires before any volume 
finishes — due to any transient stall — **every pending volume is immediately 
marked FAILED** with zero tolerance, producing false-positive volume failures.

The existing per-volume IO-failure sliding window in `StorageVolume.check()` 
does not address this because it only applies when a check **completes**, not 
when the latch times out.

## Solution

Add a per-volume consecutive latch-timeout counter (`consecutiveTimeoutCount`) 
to `StorageVolume`. When `checkAllVolumes()` latch expires and a volume has not 
yet reported a result, its counter is incremented. The volume is only added to 
the failed set if `count > hdds.datanode.disk.check.timeout.tolerated`. A 
successful check resets the counter to 0.

Volumes that explicitly return `FAILED` from `check()` (genuine IO failures, 
missing directory, bad permissions) are unaffected and continue to fail 
immediately.

## New Configuration

| Key | Default | Description |
|-----|---------|-------------|
| `hdds.datanode.disk.check.timeout.tolerated` | `1` | Number of consecutive 
latch timeouts allowed per volume before marking it FAILED. Set to `0` to 
restore current behavior. |


> DataNode: tolerate per-volume health-check latch timeouts before marking 
> volumes failed
> ---------------------------------------------------------------------------------------
>
>                 Key: HDDS-14871
>                 URL: https://issues.apache.org/jira/browse/HDDS-14871
>             Project: Apache Ozone
>          Issue Type: Task
>          Components: Ozone Datanode
>            Reporter: Devesh Kumar Singh
>            Assignee: Devesh Kumar Singh
>            Priority: Major
>
> ## Problem
> `StorageVolumeChecker.checkAllVolumes()` waits on a single `CountDownLatch` 
> for all volume health checks to complete. If the latch expires before any 
> volume finishes — due to any transient stall — **every pending volume is 
> immediately marked FAILED** with zero tolerance, producing false-positive 
> volume failures.
> The existing per-volume IO-failure sliding window in `StorageVolume.check()` 
> does not address this because it only applies when a check **completes**, not 
> when the latch times out.
> ## Solution
> Add a per-volume consecutive latch-timeout counter 
> (`consecutiveTimeoutCount`) to `StorageVolume`. When `checkAllVolumes()` 
> latch expires and a volume has not yet reported a result, its counter is 
> incremented. The volume is only added to the failed set if `count > 
> hdds.datanode.disk.check.timeout.tolerated`. A successful check resets the 
> counter to 0.
> Volumes that explicitly return `FAILED` from `check()` (genuine IO failures, 
> missing directory, bad permissions) are unaffected and continue to fail 
> immediately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to