[
https://issues.apache.org/jira/browse/HDDS-14871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Devesh Kumar Singh updated HDDS-14871:
--------------------------------------
Description:
h2. Summary
When {{StorageVolumeChecker.checkAllVolumes()}} hits the global latch timeout
({{{}hdds.datanode.disk.check.timeout{}}}), the implementation currently treats
*every* volume that has not yet reported as FAILED in one shot. This JIRA
implements per-volume *consecutive* timeout tolerance so the first timeout
round can be tolerated and the volume is only failed after tolerance is
exceeded.
h2. Proposed solution
* Add config: {{hdds.datanode.disk.check.timeout.tolerated}} (default: 1).
*{*}Meaning:{*}* allow up to *N* consecutive timeout rounds per volume before
marking failed; fail when timeouts exceed tolerance (e.g. default 1 → fail on
second consecutive timeout for that volume).
* Per {{StorageVolume}} (or appropriate type): {{timeoutCount}} +
{{recordCheckTimeout()}} / {{resetTimeoutCount()}} (reset when a volume
completes a healthy check in a finished round).
* {{{}StorageVolumeChecker.checkAllVolumes(){}}}: on latch timeout, for
pending volumes, only add to returned failed set if {{recordCheckTimeout()}}
indicates tolerance exceeded; always include volumes already in explicit
{{failedVolumes}} from {{{}VolumeCheckResult.FAILED{}}}.
was:
h2. Summary
When \{{StorageVolumeChecker.checkAllVolumes()}} hits the global latch timeout
(\{{hdds.datanode.disk.check.timeout}}), the implementation currently treats
*every* volume that has not yet reported as FAILED in one shot. Under transient
conditions (e.g. kernel I/O saturation causing \{{fsync}} in \{{DiskCheckUtil}}
to block for the full timeout on multiple volumes), this produces false volume
failures and a burst of "Volume failure" log lines at the same timestamp. This
JIRA implements *Option C*: per-volume *consecutive* timeout tolerance so the
first timeout round can be tolerated and the volume is only failed after
tolerance is exceeded.
h2. Problem
* \{{checkAllVolumes()}} uses a single \{{CountDownLatch}}; if
\{{latch.await(maxAllowedTimeForCheckMs)}} returns false, the code returns
\{{Sets.difference(allVolumes, healthyVolumes)}} — all pending volumes are
failed immediately.
* Per-volume IO failure sliding window in \{{StorageVolume.check()}} does *not*
apply to volumes whose async check never completes before the latch expires.
* JVM-level distinction of "timeout due to GC" vs "timeout due to slow
disk/fsync" is unreliable; we do not attempt it — we use the same *tolerance*
philosophy as IO checks.
h2. Proposed solution (Option C only)
* Add config: \{{hdds.datanode.disk.check.timeout.tolerated}} (default: 1).
**Meaning:** allow up to *N* consecutive timeout rounds per volume before
marking failed; fail when timeouts exceed tolerance (e.g. default 1 → fail on
second consecutive timeout for that volume).
* Per \{{StorageVolume}} (or appropriate type): \{{timeoutCount}} +
\{{recordCheckTimeout()}} / \{{resetTimeoutCount()}} (reset when a volume
completes a healthy check in a finished round).
* \{{StorageVolumeChecker.checkAllVolumes()}}: on latch timeout, for pending
volumes, only add to returned failed set if \{{recordCheckTimeout()}} indicates
tolerance exceeded; always include volumes already in explicit
\{{failedVolumes}} from \{{VolumeCheckResult.FAILED}}.
h2. Out of scope (explicitly *not* this JIRA)
* Replacing count-based IO sliding window with time-based \{{SlidingWindow}}
(HDDS-13108).
* Derived minimum sliding window duration / reviewer formula for
\{{W_effective}}.
* Changes to \{{hdds.datanode.disk.check.io.test.count}} / IO failure semantics
beyond what is required for compilation/coexistence.
h2. Acceptance criteria
* With defaults, first \{{checkAllVolumes}} timeout for a volume does not mark
that volume failed; second consecutive timeout without an intervening
successful check marks it failed.
* \{{hdds.datanode.disk.check.timeout.tolerated=0}} restores current behavior
(immediate fail on timeout for all pending volumes).
* Volumes that return \{{VolumeCheckResult.FAILED}} from \{{check()}} are still
failed immediately (no change).
* Unit/integration tests cover: tolerate-first-timeout; fail-on-second; reset
after successful check.
h2. References
* Internal design: Option C in GC-Aware-Volume-Checker-Design.md (latch timeout
only).
* Related incident context: BofA DN volume failure bursts (same-ms timestamps).
> DataNode: tolerate per-volume health-check latch timeouts before marking
> volumes failed
> ---------------------------------------------------------------------------------------
>
> Key: HDDS-14871
> URL: https://issues.apache.org/jira/browse/HDDS-14871
> Project: Apache Ozone
> Issue Type: Task
> Components: Ozone Datanode
> Reporter: Devesh Kumar Singh
> Assignee: Devesh Kumar Singh
> Priority: Major
>
> h2. Summary
> When {{StorageVolumeChecker.checkAllVolumes()}} hits the global latch timeout
> ({{{}hdds.datanode.disk.check.timeout{}}}), the implementation currently
> treats *every* volume that has not yet reported as FAILED in one shot. This
> JIRA implements per-volume *consecutive* timeout tolerance so the first
> timeout round can be tolerated and the volume is only failed after tolerance
> is exceeded.
> h2. Proposed solution
> * Add config: {{hdds.datanode.disk.check.timeout.tolerated}} (default: 1).
> *{*}Meaning:{*}* allow up to *N* consecutive timeout rounds per volume
> before marking failed; fail when timeouts exceed tolerance (e.g. default 1 →
> fail on second consecutive timeout for that volume).
> * Per {{StorageVolume}} (or appropriate type): {{timeoutCount}} +
> {{recordCheckTimeout()}} / {{resetTimeoutCount()}} (reset when a volume
> completes a healthy check in a finished round).
> * {{{}StorageVolumeChecker.checkAllVolumes(){}}}: on latch timeout, for
> pending volumes, only add to returned failed set if {{recordCheckTimeout()}}
> indicates tolerance exceeded; always include volumes already in explicit
> {{failedVolumes}} from {{{}VolumeCheckResult.FAILED{}}}.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]