[ 
https://issues.apache.org/jira/browse/HDDS-9022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Rose reassigned HDDS-9022:
--------------------------------

    Assignee: Ethan Rose

> DiskChecker incorrectly reporting errors
> ----------------------------------------
>
>                 Key: HDDS-9022
>                 URL: https://issues.apache.org/jira/browse/HDDS-9022
>             Project: Apache Ozone
>          Issue Type: Bug
>    Affects Versions: 1.3.0
>            Reporter: Mladjan Gadzic
>            Assignee: Ethan Rose
>            Priority: Major
>         Attachments: dn1.log
>
>
> During load test of an aws based ozone cluster, we see that datanodes are 
> shutting down. We believe it is caused by the new DiskChecker incorrectly 
> reporting errors.
> It happens because the class for StorageVolume's runs the check method 
> without initializing the diskCheckDir field.
> Thanks [~xBis]  for research and notes on it!
> Diff showing new log messages:
> {code:java}
> diff --git 
> a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
>  
> b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
> index b267b1d47..1cb0d0085 100644
> --- 
> a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
> +++ 
> b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
> @@ -37,6 +37,8 @@
>   * where the disk is mounted.
>   */
>  public final class DiskCheckUtil {
> +  private static final Logger LOG =
> +      LoggerFactory.getLogger(DiskCheckUtil.class);
>    private DiskCheckUtil() { }
>  
>    // For testing purposes, an alternate check implementation can be provided
> @@ -63,6 +65,20 @@ public static boolean checkPermissions(File storageDir) {
>  
>    public static boolean checkReadWrite(File storageDir, File testFileDir,
>        int numBytesToWrite) {
> +    if (storageDir == null) {
> +      LOG.info("###storageDir is null. Printing stack trace: {}",
> +          Arrays.toString(new NullPointerException().getStackTrace()));
> +    } else {
> +      LOG.info("###storageDir path={}", storageDir.getPath());
> +    }
> +
> +    if (testFileDir == null) {
> +      LOG.info("###testFileDir is null. Printing stack trace: {}",
> +          Arrays.toString(new NullPointerException().getStackTrace()));
> +    } else {
> +      LOG.info("###testFileDir path={}", testFileDir.getPath());
> +    }
> +
>      return impl.checkReadWrite(storageDir, testFileDir, numBytesToWrite);
>    }
>   {code}
>  
> Stacktrace (specifically look at lines with "###")
> {code:java}
> 2023-07-15 10:07:38,006 [Periodic HDDS volume checker] INFO 
> org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: 
> Scheduling a check for /hadoop/ozone/data/disk1/datanode/data/hdds
> 2023-07-15 10:07:38,007 [Periodic HDDS volume checker] INFO 
> org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: 
> Scheduled health check for volume /hadoop/ozone/data/disk1/datanode/data/hdds
> 2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO 
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir 
> path=/hadoop/ozone/data/disk1/datanode/data/hdds
> 2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO 
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir 
> path=/hadoop/ozone/data/disk1/datanode/data/hdds/CID-2eb5a782-379b-46c7-8bd1-8b19043c1a6e/tmp/disk-check
> 2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO 
> org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: 
> Scheduling a check for /hadoop/ozone/data/disk1/datanode
> 2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO 
> org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: 
> Scheduled health check for volume /hadoop/ozone/data/disk1/datanode
> 2023-07-15 10:07:38,010 [DataNode DiskChecker thread 0] INFO 
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir 
> path=/hadoop/ozone/data/disk1/datanode
> 2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] INFO 
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir 
> is null. Printing stack trace: 
> [org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:76),
>  
> org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629),
>  
> org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68),
>  
> org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143),
>  
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131),
>  
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75),
>  
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82),
>  
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149),
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624),
>  java.lang.Thread.run(Thread.java:750)]
> 2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] ERROR 
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: Volume 
> /hadoop/ozone/data/disk1/datanode failed health check. Could not find file 
> disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb for volume check.
> java.io.FileNotFoundException: 
> disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb (No such file or directory)
>         at java.io.FileOutputStream.open0(Native Method)
>         at java.io.FileOutputStream.open(FileOutputStream.java:270)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
>         at 
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil$DiskChecksImpl.checkReadWrite(DiskCheckUtil.java:153)
>         at 
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:82)
>         at 
> org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629)
>         at 
> org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68)
>         at 
> org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143)
>         at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
>         at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
>         at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
> {code}
>  
> Bug could not be reproduced using Docker cluster, only on real cluster. Steps 
> to reproduce it:
>  # Add/modify properties as such: 
> {noformat}
> OZONE-SITE.XML_hdds.datanode.periodic.disk.check.interval.minutes=1 
> OZONE-SITE.XML_hdds.datanode.disk.check.min.gap=0s{noformat}
>  # Wait for volume to fail health check
> Check attached log file for more context.
> Suggested workaround diff:
> {code:java}
> diff --git 
> a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
>  
> b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
> index 95d1b2c2d..634b15a8e 100644
> --- 
> a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
> +++ 
> b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
> @@ -624,6 +624,10 @@ public synchronized VolumeCheckResult check(@Nullable 
> Boolean unused)
>        return VolumeCheckResult.HEALTHY;
>      }
>  
> +    if (diskCheckDir == null) {
> +      diskCheckDir = storageDir;
> +    }
> +
>      // Since IO errors may be intermittent, volume remains healthy until the
>      // threshold of failures is crossed.
>      boolean diskChecksPassed = DiskCheckUtil.checkReadWrite(storageDir,
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

Reply via email to