[ https://issues.apache.org/jira/browse/HDDS-9022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ethan Rose reassigned HDDS-9022: -------------------------------- Assignee: Ethan Rose > DiskChecker incorrectly reporting errors > ---------------------------------------- > > Key: HDDS-9022 > URL: https://issues.apache.org/jira/browse/HDDS-9022 > Project: Apache Ozone > Issue Type: Bug > Affects Versions: 1.3.0 > Reporter: Mladjan Gadzic > Assignee: Ethan Rose > Priority: Major > Attachments: dn1.log > > > During load test of an aws based ozone cluster, we see that datanodes are > shutting down. We believe it is caused by the new DiskChecker incorrectly > reporting errors. > It happens because the class for StorageVolume's runs the check method > without initializing the diskCheckDir field. > Thanks [~xBis] for research and notes on it! > Diff showing new log messages: > {code:java} > diff --git > a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java > > b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java > index b267b1d47..1cb0d0085 100644 > --- > a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java > +++ > b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java > @@ -37,6 +37,8 @@ > * where the disk is mounted. > */ > public final class DiskCheckUtil { > + private static final Logger LOG = > + LoggerFactory.getLogger(DiskCheckUtil.class); > private DiskCheckUtil() { } > > // For testing purposes, an alternate check implementation can be provided > @@ -63,6 +65,20 @@ public static boolean checkPermissions(File storageDir) { > > public static boolean checkReadWrite(File storageDir, File testFileDir, > int numBytesToWrite) { > + if (storageDir == null) { > + LOG.info("###storageDir is null. Printing stack trace: {}", > + Arrays.toString(new NullPointerException().getStackTrace())); > + } else { > + LOG.info("###storageDir path={}", storageDir.getPath()); > + } > + > + if (testFileDir == null) { > + LOG.info("###testFileDir is null. Printing stack trace: {}", > + Arrays.toString(new NullPointerException().getStackTrace())); > + } else { > + LOG.info("###testFileDir path={}", testFileDir.getPath()); > + } > + > return impl.checkReadWrite(storageDir, testFileDir, numBytesToWrite); > } > {code} > > Stacktrace (specifically look at lines with "###") > {code:java} > 2023-07-15 10:07:38,006 [Periodic HDDS volume checker] INFO > org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: > Scheduling a check for /hadoop/ozone/data/disk1/datanode/data/hdds > 2023-07-15 10:07:38,007 [Periodic HDDS volume checker] INFO > org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: > Scheduled health check for volume /hadoop/ozone/data/disk1/datanode/data/hdds > 2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO > org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir > path=/hadoop/ozone/data/disk1/datanode/data/hdds > 2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO > org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir > path=/hadoop/ozone/data/disk1/datanode/data/hdds/CID-2eb5a782-379b-46c7-8bd1-8b19043c1a6e/tmp/disk-check > 2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO > org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: > Scheduling a check for /hadoop/ozone/data/disk1/datanode > 2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO > org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: > Scheduled health check for volume /hadoop/ozone/data/disk1/datanode > 2023-07-15 10:07:38,010 [DataNode DiskChecker thread 0] INFO > org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir > path=/hadoop/ozone/data/disk1/datanode > 2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] INFO > org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir > is null. Printing stack trace: > [org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:76), > > org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629), > > org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68), > > org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143), > > com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131), > > com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75), > > com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82), > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149), > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624), > java.lang.Thread.run(Thread.java:750)] > 2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] ERROR > org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: Volume > /hadoop/ozone/data/disk1/datanode failed health check. Could not find file > disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb for volume check. > java.io.FileNotFoundException: > disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb (No such file or directory) > at java.io.FileOutputStream.open0(Native Method) > at java.io.FileOutputStream.open(FileOutputStream.java:270) > at java.io.FileOutputStream.<init>(FileOutputStream.java:213) > at java.io.FileOutputStream.<init>(FileOutputStream.java:162) > at > org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil$DiskChecksImpl.checkReadWrite(DiskCheckUtil.java:153) > at > org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:82) > at > org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629) > at > org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68) > at > org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143) > at > com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131) > at > com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75) > at > com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {code} > > Bug could not be reproduced using Docker cluster, only on real cluster. Steps > to reproduce it: > # Add/modify properties as such: > {noformat} > OZONE-SITE.XML_hdds.datanode.periodic.disk.check.interval.minutes=1 > OZONE-SITE.XML_hdds.datanode.disk.check.min.gap=0s{noformat} > # Wait for volume to fail health check > Check attached log file for more context. > Suggested workaround diff: > {code:java} > diff --git > a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java > > b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java > index 95d1b2c2d..634b15a8e 100644 > --- > a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java > +++ > b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java > @@ -624,6 +624,10 @@ public synchronized VolumeCheckResult check(@Nullable > Boolean unused) > return VolumeCheckResult.HEALTHY; > } > > + if (diskCheckDir == null) { > + diskCheckDir = storageDir; > + } > + > // Since IO errors may be intermittent, volume remains healthy until the > // threshold of failures is crossed. > boolean diskChecksPassed = DiskCheckUtil.checkReadWrite(storageDir, > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org