[ 
https://issues.apache.org/jira/browse/HDDS-14871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devesh Kumar Singh updated HDDS-14871:
--------------------------------------
    Description: 
h2. Summary

When {{StorageVolumeChecker.checkAllVolumes()}} hits the global latch timeout 
({{{}hdds.datanode.disk.check.timeout{}}}), the implementation currently treats 
*every* volume that has not yet reported as FAILED in one shot. This JIRA 
implements per-volume *consecutive* timeout tolerance so the first timeout 
round can be tolerated and the volume is only failed after tolerance is 
exceeded.
h2. Proposed solution
 * Add config: {{hdds.datanode.disk.check.timeout.tolerated}} (default: 1).
  *{*}Meaning:{*}* allow up to *N* consecutive timeout rounds per volume before 
marking failed; fail when timeouts exceed tolerance (e.g. default 1 → fail on 
second consecutive timeout for that volume).
 * Per {{StorageVolume}} (or appropriate type): {{timeoutCount}} + 
{{recordCheckTimeout()}} / {{resetTimeoutCount()}} (reset when a volume 
completes a healthy check in a finished round).
 * {{{}StorageVolumeChecker.checkAllVolumes(){}}}: on latch timeout, for 
pending volumes, only add to returned failed set if {{recordCheckTimeout()}} 
indicates tolerance exceeded; always include volumes already in explicit 
{{failedVolumes}} from {{{}VolumeCheckResult.FAILED{}}}.

 

  was:
h2. Summary
When \{{StorageVolumeChecker.checkAllVolumes()}} hits the global latch timeout 
(\{{hdds.datanode.disk.check.timeout}}), the implementation currently treats 
*every* volume that has not yet reported as FAILED in one shot. Under transient 
conditions (e.g. kernel I/O saturation causing \{{fsync}} in \{{DiskCheckUtil}} 
to block for the full timeout on multiple volumes), this produces false volume 
failures and a burst of "Volume failure" log lines at the same timestamp. This 
JIRA implements *Option C*: per-volume *consecutive* timeout tolerance so the 
first timeout round can be tolerated and the volume is only failed after 
tolerance is exceeded.

h2. Problem
* \{{checkAllVolumes()}} uses a single \{{CountDownLatch}}; if 
\{{latch.await(maxAllowedTimeForCheckMs)}} returns false, the code returns 
\{{Sets.difference(allVolumes, healthyVolumes)}} — all pending volumes are 
failed immediately.
* Per-volume IO failure sliding window in \{{StorageVolume.check()}} does *not* 
apply to volumes whose async check never completes before the latch expires.
* JVM-level distinction of "timeout due to GC" vs "timeout due to slow 
disk/fsync" is unreliable; we do not attempt it — we use the same *tolerance* 
philosophy as IO checks.

h2. Proposed solution (Option C only)
* Add config: \{{hdds.datanode.disk.check.timeout.tolerated}} (default: 1).
  **Meaning:** allow up to *N* consecutive timeout rounds per volume before 
marking failed; fail when timeouts exceed tolerance (e.g. default 1 → fail on 
second consecutive timeout for that volume).
* Per \{{StorageVolume}} (or appropriate type): \{{timeoutCount}} + 
\{{recordCheckTimeout()}} / \{{resetTimeoutCount()}} (reset when a volume 
completes a healthy check in a finished round).
* \{{StorageVolumeChecker.checkAllVolumes()}}: on latch timeout, for pending 
volumes, only add to returned failed set if \{{recordCheckTimeout()}} indicates 
tolerance exceeded; always include volumes already in explicit 
\{{failedVolumes}} from \{{VolumeCheckResult.FAILED}}.

h2. Out of scope (explicitly *not* this JIRA)
* Replacing count-based IO sliding window with time-based \{{SlidingWindow}} 
(HDDS-13108).
* Derived minimum sliding window duration / reviewer formula for 
\{{W_effective}}.
* Changes to \{{hdds.datanode.disk.check.io.test.count}} / IO failure semantics 
beyond what is required for compilation/coexistence.

h2. Acceptance criteria
* With defaults, first \{{checkAllVolumes}} timeout for a volume does not mark 
that volume failed; second consecutive timeout without an intervening 
successful check marks it failed.
* \{{hdds.datanode.disk.check.timeout.tolerated=0}} restores current behavior 
(immediate fail on timeout for all pending volumes).
* Volumes that return \{{VolumeCheckResult.FAILED}} from \{{check()}} are still 
failed immediately (no change).
* Unit/integration tests cover: tolerate-first-timeout; fail-on-second; reset 
after successful check.

h2. References
* Internal design: Option C in GC-Aware-Volume-Checker-Design.md (latch timeout 
only).
* Related incident context: BofA DN volume failure bursts (same-ms timestamps).


> DataNode: tolerate per-volume health-check latch timeouts before marking 
> volumes failed
> ---------------------------------------------------------------------------------------
>
>                 Key: HDDS-14871
>                 URL: https://issues.apache.org/jira/browse/HDDS-14871
>             Project: Apache Ozone
>          Issue Type: Task
>          Components: Ozone Datanode
>            Reporter: Devesh Kumar Singh
>            Assignee: Devesh Kumar Singh
>            Priority: Major
>
> h2. Summary
> When {{StorageVolumeChecker.checkAllVolumes()}} hits the global latch timeout 
> ({{{}hdds.datanode.disk.check.timeout{}}}), the implementation currently 
> treats *every* volume that has not yet reported as FAILED in one shot. This 
> JIRA implements per-volume *consecutive* timeout tolerance so the first 
> timeout round can be tolerated and the volume is only failed after tolerance 
> is exceeded.
> h2. Proposed solution
>  * Add config: {{hdds.datanode.disk.check.timeout.tolerated}} (default: 1).
>   *{*}Meaning:{*}* allow up to *N* consecutive timeout rounds per volume 
> before marking failed; fail when timeouts exceed tolerance (e.g. default 1 → 
> fail on second consecutive timeout for that volume).
>  * Per {{StorageVolume}} (or appropriate type): {{timeoutCount}} + 
> {{recordCheckTimeout()}} / {{resetTimeoutCount()}} (reset when a volume 
> completes a healthy check in a finished round).
>  * {{{}StorageVolumeChecker.checkAllVolumes(){}}}: on latch timeout, for 
> pending volumes, only add to returned failed set if {{recordCheckTimeout()}} 
> indicates tolerance exceeded; always include volumes already in explicit 
> {{failedVolumes}} from {{{}VolumeCheckResult.FAILED{}}}.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to