Re: [PR] HDDS-14871. DataNode: tolerate per-volume health-check latch timeouts before marking volumes failed. [ozone]

via GitHub Tue, 14 Apr 2026 07:37:47 -0700


devmadhuu commented on code in PR #9954:
URL: https://github.com/apache/ozone/pull/9954#discussion_r3080254441



##########
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java:
##########
@@ -753,6 +739,92 @@ public synchronized VolumeCheckResult check(@Nullable 
Boolean unused)
     return VolumeCheckResult.HEALTHY;
   }
 
+  /**
+   * Called by {@link StorageVolumeChecker} when a volume check times out —
+   * either because the global {@code checkAllVolumes()} latch expired before
+   * this volume's async check completed, or because the per-check timeout
+   * inside {@link ThrottledAsyncChecker} fired.
+   *
+   * <p><b>Must not be {@code synchronized}.</b> When a timeout fires,
+   * {@link #check} may still be executing and holding the object lock — that
+   * is precisely why the timeout occurred. Acquiring the same lock here would
+   * deadlock or stall {@link StorageVolumeChecker#checkAllVolumes} until the
+   * hung check finally returns.
+   *
+   * <p>Instead, a dedicated {@link AtomicInteger} ({@code
+   * consecutiveTimeoutCount}) tracks consecutive timeouts without any locking.
+   * The threshold reuses the existing {@code ioFailureTolerance} so no
+   * additional configuration key is required.
+   *
+   * <p>Recovery: call {@link #resetTimeoutCount()} when a check completes
+   * successfully to break the timeout streak.
+   *
+   * @return {@code true} if {@code consecutiveTimeoutCount > 
ioFailureTolerance},
+   *         meaning the volume should now be marked FAILED; {@code false} if
+   *         the timeout is still within tolerance this round.
+   */
+  public boolean recordTimeoutAsIOFailure() {
+    int count = consecutiveTimeoutCount.incrementAndGet();
+    if (count > ioFailureTolerance) {
+      LOG.error("Volume {} check timed out {} consecutive time(s),"
+              + " exceeding tolerance of {}. Marking FAILED.",
+          this, count, ioFailureTolerance);
+      return true;
+    }
+    LOG.warn("Volume {} check timed out ({}/{} consecutive timeouts 
tolerated)."
+            + " Common transient causes: kernel I/O scheduler saturation"
+            + " or JVM GC pressure. Volume will be failed if the next check"
+            + " also times out.",
+        this, count, ioFailureTolerance);
+    return false;
+  }
+
+  /**
+   * Resets the consecutive-timeout counter to 0.
+   *
+   * <p>Called by {@link StorageVolumeChecker} when this volume's check
+   * completes successfully, indicating that the transient stall has resolved
+   * and any accumulated timeout count should not carry over to the next cycle.
+   *
+   * <p>No synchronization needed — operates on an {@link AtomicInteger}.
+   */
+  public void resetTimeoutCount() {
+    int prev = consecutiveTimeoutCount.getAndSet(0);
+    if (prev > 0 && LOG.isDebugEnabled()) {
+      LOG.debug("Volume {} completed a healthy check. Consecutive timeout"
+          + " count reset from {} to 0.", this, prev);
+    }
+  }

Review Comment:
   @ptlrs I have rebased the PR on #8843 and did changes. Can you have a look 
once ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-14871. DataNode: tolerate per-volume health-check latch timeouts before marking volumes failed. [ozone]

Reply via email to