errose28 commented on code in PR #4867:
URL: https://github.com/apache/ozone/pull/4867#discussion_r1244224564


##########
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java:
##########
@@ -509,21 +545,110 @@ public void failVolume() {
   public void shutdown() {
     setState(VolumeState.NON_EXISTENT);
     volumeInfo.ifPresent(VolumeInfo::shutdownUsageThread);
+    cleanTmpDiskCheckDir();
+  }
+
+  /**
+   * Delete all temporary files in the directory used ot check disk health.
+   */
+  private void cleanTmpDiskCheckDir() {
+    // If the volume was shut down before initialization completed, skip
+    // emptying the directory.
+    if (diskCheckDir == null) {
+      return;
+    }
+
+    if (!diskCheckDir.exists()) {
+      LOG.warn("Unable to clear disk check files from {}. Directory does " +
+          "not exist.", diskCheckDir);
+      return;
+    }
+
+    if (!diskCheckDir.isDirectory()) {
+      LOG.warn("Unable to clear disk check files from {}. Location is not a" +
+          " directory", diskCheckDir);
+      return;
+    }
+
+    try (Stream<Path> files = Files.list(diskCheckDir.toPath())) {
+      files.map(Path::toFile).filter(File::isFile).forEach(file -> {
+        try {
+          Files.delete(file.toPath());
+        } catch (IOException ex) {
+          LOG.warn("Failed to delete temporary volume health check file {}",
+              file);
+        }
+      });
+    } catch (IOException ex) {
+      LOG.warn("Failed to list contents of volume health check directory {} " +
+          "for deleting.", diskCheckDir);
+    }
   }
 
   /**
    * Run a check on the current volume to determine if it is healthy.
    * @param unused context for the check, ignored.
    * @return result of checking the volume.
+   * @throws InterruptedException if there was an error during the volume
+   *    check because the thread was interrupted.
    * @throws Exception if an exception was encountered while running
-   *            the volume check.
+   *    the volume check and the thread was not interrupted.
    */
   @Override
-  public VolumeCheckResult check(@Nullable Boolean unused) throws Exception {
-    if (!storageDir.exists()) {
+  public synchronized VolumeCheckResult check(@Nullable Boolean unused)
+      throws Exception {
+    boolean directoryChecksPassed =
+        DiskCheckUtil.checkExistence(storageDir) &&
+        DiskCheckUtil.checkPermissions(storageDir);
+    // If the directory is not present or has incorrect permissions, fail the
+    // volume immediately. This is not an intermittent error.
+    if (!directoryChecksPassed) {
+      if (Thread.currentThread().isInterrupted()) {
+        throw new InterruptedException("Directory check of volume " + this +
+            " interrupted.");
+      }
+      return VolumeCheckResult.FAILED;
+    }
+
+    // If IO test count is set to 0, IO tests for disk health are disabled.
+    if (ioTestCount == 0) {
+      return VolumeCheckResult.HEALTHY;
+    }
+
+    // Since IO errors may be intermittent, volume remains healthy until the
+    // threshold of failures is crossed.
+    boolean diskChecksPassed = DiskCheckUtil.checkReadWrite(storageDir,
+        diskCheckDir, healthCheckFileSize);
+    currentIOTestCount.incrementAndGet();
+    if (!diskChecksPassed) {
+      if (Thread.currentThread().isInterrupted()) {
+        throw new InterruptedException("IO check of volume " + this +
+            " interrupted.");
+      }
+      currentIOFailureCount.incrementAndGet();
+    }
+
+    if (LOG.isDebugEnabled()) {
+      LOG.debug("Volume {} ran IO test {} of {}. Seen {} of {} tolerated IO" +
+              "failures.", this, currentIOTestCount, ioTestCount,
+          currentIOFailureCount, ioFailureTolerance);
+    }
+
+    // If the failure threshold has been crossed, fail the volume without
+    // further scans.
+    // Once the volume is failed, it will not be checked anymore.
+    // The failure counts can be left as is.
+    if (currentIOFailureCount.get() > ioFailureTolerance) {
       return VolumeCheckResult.FAILED;
     }
-    DiskChecker.checkDir(storageDir);
+
+    // If we have completed the required number of volume checks without too
+    // many IO failures, reset the counters for future iterations.
+    if (currentIOTestCount.get() == ioTestCount) {

Review Comment:
   I considered a sliding window as well, but did it this way since maintaining 
the counters was simpler. Since ioTestCount should be small, a sliding window 
with a queue of max size ioTestCount for each volume should be fine. I will 
switch to this implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to