Devesh Kumar Singh created HDDS-14871:
-----------------------------------------

             Summary: DataNode: tolerate per-volume health-check latch timeouts 
before marking volumes failed
                 Key: HDDS-14871
                 URL: https://issues.apache.org/jira/browse/HDDS-14871
             Project: Apache Ozone
          Issue Type: Task
          Components: Ozone Datanode
            Reporter: Devesh Kumar Singh
            Assignee: Devesh Kumar Singh


h2. Summary
When \{{StorageVolumeChecker.checkAllVolumes()}} hits the global latch timeout 
(\{{hdds.datanode.disk.check.timeout}}), the implementation currently treats 
*every* volume that has not yet reported as FAILED in one shot. Under transient 
conditions (e.g. kernel I/O saturation causing \{{fsync}} in \{{DiskCheckUtil}} 
to block for the full timeout on multiple volumes), this produces false volume 
failures and a burst of "Volume failure" log lines at the same timestamp. This 
JIRA implements *Option C*: per-volume *consecutive* timeout tolerance so the 
first timeout round can be tolerated and the volume is only failed after 
tolerance is exceeded.

h2. Problem
* \{{checkAllVolumes()}} uses a single \{{CountDownLatch}}; if 
\{{latch.await(maxAllowedTimeForCheckMs)}} returns false, the code returns 
\{{Sets.difference(allVolumes, healthyVolumes)}} — all pending volumes are 
failed immediately.
* Per-volume IO failure sliding window in \{{StorageVolume.check()}} does *not* 
apply to volumes whose async check never completes before the latch expires.
* JVM-level distinction of "timeout due to GC" vs "timeout due to slow 
disk/fsync" is unreliable; we do not attempt it — we use the same *tolerance* 
philosophy as IO checks.

h2. Proposed solution (Option C only)
* Add config: \{{hdds.datanode.disk.check.timeout.tolerated}} (default: 1).
  **Meaning:** allow up to *N* consecutive timeout rounds per volume before 
marking failed; fail when timeouts exceed tolerance (e.g. default 1 → fail on 
second consecutive timeout for that volume).
* Per \{{StorageVolume}} (or appropriate type): \{{timeoutCount}} + 
\{{recordCheckTimeout()}} / \{{resetTimeoutCount()}} (reset when a volume 
completes a healthy check in a finished round).
* \{{StorageVolumeChecker.checkAllVolumes()}}: on latch timeout, for pending 
volumes, only add to returned failed set if \{{recordCheckTimeout()}} indicates 
tolerance exceeded; always include volumes already in explicit 
\{{failedVolumes}} from \{{VolumeCheckResult.FAILED}}.

h2. Out of scope (explicitly *not* this JIRA)
* Replacing count-based IO sliding window with time-based \{{SlidingWindow}} 
(HDDS-13108).
* Derived minimum sliding window duration / reviewer formula for 
\{{W_effective}}.
* Changes to \{{hdds.datanode.disk.check.io.test.count}} / IO failure semantics 
beyond what is required for compilation/coexistence.

h2. Acceptance criteria
* With defaults, first \{{checkAllVolumes}} timeout for a volume does not mark 
that volume failed; second consecutive timeout without an intervening 
successful check marks it failed.
* \{{hdds.datanode.disk.check.timeout.tolerated=0}} restores current behavior 
(immediate fail on timeout for all pending volumes).
* Volumes that return \{{VolumeCheckResult.FAILED}} from \{{check()}} are still 
failed immediately (no change).
* Unit/integration tests cover: tolerate-first-timeout; fail-on-second; reset 
after successful check.

h2. References
* Internal design: Option C in GC-Aware-Volume-Checker-Design.md (latch timeout 
only).
* Related incident context: BofA DN volume failure bursts (same-ms timestamps).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to