Devesh Kumar Singh created HDDS-14871:
-----------------------------------------
Summary: DataNode: tolerate per-volume health-check latch timeouts
before marking volumes failed
Key: HDDS-14871
URL: https://issues.apache.org/jira/browse/HDDS-14871
Project: Apache Ozone
Issue Type: Task
Components: Ozone Datanode
Reporter: Devesh Kumar Singh
Assignee: Devesh Kumar Singh
h2. Summary
When \{{StorageVolumeChecker.checkAllVolumes()}} hits the global latch timeout
(\{{hdds.datanode.disk.check.timeout}}), the implementation currently treats
*every* volume that has not yet reported as FAILED in one shot. Under transient
conditions (e.g. kernel I/O saturation causing \{{fsync}} in \{{DiskCheckUtil}}
to block for the full timeout on multiple volumes), this produces false volume
failures and a burst of "Volume failure" log lines at the same timestamp. This
JIRA implements *Option C*: per-volume *consecutive* timeout tolerance so the
first timeout round can be tolerated and the volume is only failed after
tolerance is exceeded.
h2. Problem
* \{{checkAllVolumes()}} uses a single \{{CountDownLatch}}; if
\{{latch.await(maxAllowedTimeForCheckMs)}} returns false, the code returns
\{{Sets.difference(allVolumes, healthyVolumes)}} — all pending volumes are
failed immediately.
* Per-volume IO failure sliding window in \{{StorageVolume.check()}} does *not*
apply to volumes whose async check never completes before the latch expires.
* JVM-level distinction of "timeout due to GC" vs "timeout due to slow
disk/fsync" is unreliable; we do not attempt it — we use the same *tolerance*
philosophy as IO checks.
h2. Proposed solution (Option C only)
* Add config: \{{hdds.datanode.disk.check.timeout.tolerated}} (default: 1).
**Meaning:** allow up to *N* consecutive timeout rounds per volume before
marking failed; fail when timeouts exceed tolerance (e.g. default 1 → fail on
second consecutive timeout for that volume).
* Per \{{StorageVolume}} (or appropriate type): \{{timeoutCount}} +
\{{recordCheckTimeout()}} / \{{resetTimeoutCount()}} (reset when a volume
completes a healthy check in a finished round).
* \{{StorageVolumeChecker.checkAllVolumes()}}: on latch timeout, for pending
volumes, only add to returned failed set if \{{recordCheckTimeout()}} indicates
tolerance exceeded; always include volumes already in explicit
\{{failedVolumes}} from \{{VolumeCheckResult.FAILED}}.
h2. Out of scope (explicitly *not* this JIRA)
* Replacing count-based IO sliding window with time-based \{{SlidingWindow}}
(HDDS-13108).
* Derived minimum sliding window duration / reviewer formula for
\{{W_effective}}.
* Changes to \{{hdds.datanode.disk.check.io.test.count}} / IO failure semantics
beyond what is required for compilation/coexistence.
h2. Acceptance criteria
* With defaults, first \{{checkAllVolumes}} timeout for a volume does not mark
that volume failed; second consecutive timeout without an intervening
successful check marks it failed.
* \{{hdds.datanode.disk.check.timeout.tolerated=0}} restores current behavior
(immediate fail on timeout for all pending volumes).
* Volumes that return \{{VolumeCheckResult.FAILED}} from \{{check()}} are still
failed immediately (no change).
* Unit/integration tests cover: tolerate-first-timeout; fail-on-second; reset
after successful check.
h2. References
* Internal design: Option C in GC-Aware-Volume-Checker-Design.md (latch timeout
only).
* Related incident context: BofA DN volume failure bursts (same-ms timestamps).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]