Mladjan Gadzic created HDDS-9022:
------------------------------------

             Summary: DiskChecker incorrectly reporting errors
                 Key: HDDS-9022
                 URL: https://issues.apache.org/jira/browse/HDDS-9022
             Project: Apache Ozone
          Issue Type: Bug
    Affects Versions: 1.3.0
            Reporter: Mladjan Gadzic
         Attachments: dn1.log

During load test of an aws based ozone cluster, we see that datanodes are 
shutting down. We believe it is caused by the new DiskChecker incorrectly 
reporting errors.
It happens because the class for StorageVolume's runs the check method without 
initializing the diskCheckDir field.
Thanks [~xBis]  for research and notes on it!

Diff showing new log messages:
{code:java}
diff --git 
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
 
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
index b267b1d47..1cb0d0085 100644
--- 
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
+++ 
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
@@ -37,6 +37,8 @@
  * where the disk is mounted.
  */
 public final class DiskCheckUtil {
+  private static final Logger LOG =
+      LoggerFactory.getLogger(DiskCheckUtil.class);
   private DiskCheckUtil() { }
 
   // For testing purposes, an alternate check implementation can be provided
@@ -63,6 +65,20 @@ public static boolean checkPermissions(File storageDir) {
 
   public static boolean checkReadWrite(File storageDir, File testFileDir,
       int numBytesToWrite) {
+    if (storageDir == null) {
+      LOG.info("###storageDir is null. Printing stack trace: {}",
+          Arrays.toString(new NullPointerException().getStackTrace()));
+    } else {
+      LOG.info("###storageDir path={}", storageDir.getPath());
+    }
+
+    if (testFileDir == null) {
+      LOG.info("###testFileDir is null. Printing stack trace: {}",
+          Arrays.toString(new NullPointerException().getStackTrace()));
+    } else {
+      LOG.info("###testFileDir path={}", testFileDir.getPath());
+    }
+
     return impl.checkReadWrite(storageDir, testFileDir, numBytesToWrite);
   }
  {code}
 

Stacktrace (specifically look at lines with "###")
{code:java}
2023-07-15 10:07:38,006 [Periodic HDDS volume checker] INFO 
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: 
Scheduling a check for /hadoop/ozone/data/disk1/datanode/data/hdds
2023-07-15 10:07:38,007 [Periodic HDDS volume checker] INFO 
org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: Scheduled 
health check for volume /hadoop/ozone/data/disk1/datanode/data/hdds
2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO 
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir 
path=/hadoop/ozone/data/disk1/datanode/data/hdds
2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO 
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir 
path=/hadoop/ozone/data/disk1/datanode/data/hdds/CID-2eb5a782-379b-46c7-8bd1-8b19043c1a6e/tmp/disk-check
2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO 
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: 
Scheduling a check for /hadoop/ozone/data/disk1/datanode
2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO 
org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: Scheduled 
health check for volume /hadoop/ozone/data/disk1/datanode
2023-07-15 10:07:38,010 [DataNode DiskChecker thread 0] INFO 
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir 
path=/hadoop/ozone/data/disk1/datanode
2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] INFO 
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir is 
null. Printing stack trace: 
[org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:76),
 
org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629),
 
org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68),
 
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143),
 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131),
 
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75),
 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82),
 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149),
 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624),
 java.lang.Thread.run(Thread.java:750)]
2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] ERROR 
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: Volume 
/hadoop/ozone/data/disk1/datanode failed health check. Could not find file 
disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb for volume check.
java.io.FileNotFoundException: disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb 
(No such file or directory)
        at java.io.FileOutputStream.open0(Native Method)
        at java.io.FileOutputStream.open(FileOutputStream.java:270)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
        at 
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil$DiskChecksImpl.checkReadWrite(DiskCheckUtil.java:153)
        at 
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:82)
        at 
org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629)
        at 
org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68)
        at 
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
        at 
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
{code}
 

Bug could not be reproduced using Docker cluster, only on real cluster. Steps 
to reproduce it:
 # Add/modify properties as such:

{code:java}
OZONE-SITE.XML_hdds.datanode.periodic.disk.check.interval.minutes=1
OZONE-SITE.XML_hdds.datanode.disk.check.min.gap=0s {code}

 # Wait for volume to fail health check

Check attached log file for more context.

Suggested workaround diff:
{code:java}
diff --git 
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
 
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
index 95d1b2c2d..634b15a8e 100644
--- 
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
+++ 
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
@@ -624,6 +624,10 @@ public synchronized VolumeCheckResult check(@Nullable 
Boolean unused)
       return VolumeCheckResult.HEALTHY;
     }
 
+    if (diskCheckDir == null) {
+      diskCheckDir = storageDir;
+    }
+
     // Since IO errors may be intermittent, volume remains healthy until the
     // threshold of failures is crossed.
     boolean diskChecksPassed = DiskCheckUtil.checkReadWrite(storageDir,
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to