[
https://issues.apache.org/jira/browse/HDDS-9022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mladjan Gadzic updated HDDS-9022:
---------------------------------
Description:
During load test of an aws based ozone cluster, we see that datanodes are
shutting down. We believe it is caused by the new DiskChecker incorrectly
reporting errors.
It happens because the class for StorageVolume's runs the check method without
initializing the diskCheckDir field.
Thanks [~xBis] for research and notes on it!
Diff showing new log messages:
{code:java}
diff --git
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
index b267b1d47..1cb0d0085 100644
---
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
+++
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
@@ -37,6 +37,8 @@
* where the disk is mounted.
*/
public final class DiskCheckUtil {
+ private static final Logger LOG =
+ LoggerFactory.getLogger(DiskCheckUtil.class);
private DiskCheckUtil() { }
// For testing purposes, an alternate check implementation can be provided
@@ -63,6 +65,20 @@ public static boolean checkPermissions(File storageDir) {
public static boolean checkReadWrite(File storageDir, File testFileDir,
int numBytesToWrite) {
+ if (storageDir == null) {
+ LOG.info("###storageDir is null. Printing stack trace: {}",
+ Arrays.toString(new NullPointerException().getStackTrace()));
+ } else {
+ LOG.info("###storageDir path={}", storageDir.getPath());
+ }
+
+ if (testFileDir == null) {
+ LOG.info("###testFileDir is null. Printing stack trace: {}",
+ Arrays.toString(new NullPointerException().getStackTrace()));
+ } else {
+ LOG.info("###testFileDir path={}", testFileDir.getPath());
+ }
+
return impl.checkReadWrite(storageDir, testFileDir, numBytesToWrite);
}
{code}
Stacktrace (specifically look at lines with "###")
{code:java}
2023-07-15 10:07:38,006 [Periodic HDDS volume checker] INFO
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker:
Scheduling a check for /hadoop/ozone/data/disk1/datanode/data/hdds
2023-07-15 10:07:38,007 [Periodic HDDS volume checker] INFO
org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: Scheduled
health check for volume /hadoop/ozone/data/disk1/datanode/data/hdds
2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir
path=/hadoop/ozone/data/disk1/datanode/data/hdds
2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir
path=/hadoop/ozone/data/disk1/datanode/data/hdds/CID-2eb5a782-379b-46c7-8bd1-8b19043c1a6e/tmp/disk-check
2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker:
Scheduling a check for /hadoop/ozone/data/disk1/datanode
2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO
org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: Scheduled
health check for volume /hadoop/ozone/data/disk1/datanode
2023-07-15 10:07:38,010 [DataNode DiskChecker thread 0] INFO
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir
path=/hadoop/ozone/data/disk1/datanode
2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] INFO
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir is
null. Printing stack trace:
[org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:76),
org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629),
org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68),
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143),
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131),
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75),
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82),
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149),
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624),
java.lang.Thread.run(Thread.java:750)]
2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] ERROR
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: Volume
/hadoop/ozone/data/disk1/datanode failed health check. Could not find file
disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb for volume check.
java.io.FileNotFoundException: disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb
(No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
at
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil$DiskChecksImpl.checkReadWrite(DiskCheckUtil.java:153)
at
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:82)
at
org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629)
at
org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68)
at
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143)
at
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
at
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
at
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
{code}
Bug could not be reproduced using Docker cluster, only on real cluster. Steps
to reproduce it:
# Add/modify properties as such:
{code:java}
OZONE-SITE.XML_hdds.datanode.periodic.disk.check.interval.minutes=1
OZONE-SITE.XML_hdds.datanode.disk.check.min.gap=0s {code}
# Wait for volume to fail health check
Check attached log file for more context.
Suggested workaround diff:
{code:java}
diff --git
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
index 95d1b2c2d..634b15a8e 100644
---
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
+++
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
@@ -624,6 +624,10 @@ public synchronized VolumeCheckResult check(@Nullable
Boolean unused)
return VolumeCheckResult.HEALTHY;
}
+ if (diskCheckDir == null) {
+ diskCheckDir = storageDir;
+ }
+
// Since IO errors may be intermittent, volume remains healthy until the
// threshold of failures is crossed.
boolean diskChecksPassed = DiskCheckUtil.checkReadWrite(storageDir,
{code}
was:
During load test of an aws based ozone cluster, we see that datanodes are
shutting down. We believe it is caused by the new DiskChecker incorrectly
reporting errors.
It happens because the class for StorageVolume's runs the check method without
initializing the diskCheckDir field.
Thanks [~xBis] for research and notes on it!
Diff showing new log messages:
{code:java}
diff --git
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
index b267b1d47..1cb0d0085 100644
---
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
+++
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
@@ -37,6 +37,8 @@
* where the disk is mounted.
*/
public final class DiskCheckUtil {
+ private static final Logger LOG =
+ LoggerFactory.getLogger(DiskCheckUtil.class);
private DiskCheckUtil() { }
// For testing purposes, an alternate check implementation can be provided
@@ -63,6 +65,20 @@ public static boolean checkPermissions(File storageDir) {
public static boolean checkReadWrite(File storageDir, File testFileDir,
int numBytesToWrite) {
+ if (storageDir == null) {
+ LOG.info("###storageDir is null. Printing stack trace: {}",
+ Arrays.toString(new NullPointerException().getStackTrace()));
+ } else {
+ LOG.info("###storageDir path={}", storageDir.getPath());
+ }
+
+ if (testFileDir == null) {
+ LOG.info("###testFileDir is null. Printing stack trace: {}",
+ Arrays.toString(new NullPointerException().getStackTrace()));
+ } else {
+ LOG.info("###testFileDir path={}", testFileDir.getPath());
+ }
+
return impl.checkReadWrite(storageDir, testFileDir, numBytesToWrite);
}
{code}
Stacktrace (specifically look at lines with "###")
{code:java}
2023-07-15 10:07:38,006 [Periodic HDDS volume checker] INFO
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker:
Scheduling a check for /hadoop/ozone/data/disk1/datanode/data/hdds
2023-07-15 10:07:38,007 [Periodic HDDS volume checker] INFO
org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: Scheduled
health check for volume /hadoop/ozone/data/disk1/datanode/data/hdds
2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir
path=/hadoop/ozone/data/disk1/datanode/data/hdds
2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir
path=/hadoop/ozone/data/disk1/datanode/data/hdds/CID-2eb5a782-379b-46c7-8bd1-8b19043c1a6e/tmp/disk-check
2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker:
Scheduling a check for /hadoop/ozone/data/disk1/datanode
2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO
org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: Scheduled
health check for volume /hadoop/ozone/data/disk1/datanode
2023-07-15 10:07:38,010 [DataNode DiskChecker thread 0] INFO
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir
path=/hadoop/ozone/data/disk1/datanode
2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] INFO
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir is
null. Printing stack trace:
[org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:76),
org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629),
org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68),
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143),
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131),
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75),
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82),
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149),
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624),
java.lang.Thread.run(Thread.java:750)]
2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] ERROR
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: Volume
/hadoop/ozone/data/disk1/datanode failed health check. Could not find file
disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb for volume check.
java.io.FileNotFoundException: disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb
(No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
at
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil$DiskChecksImpl.checkReadWrite(DiskCheckUtil.java:153)
at
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:82)
at
org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629)
at
org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68)
at
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143)
at
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
at
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
at
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
{code}
Bug could not be reproduced using Docker cluster, only on real cluster. Steps
to reproduce it:
# Add/modify properties as such:
{code:java}
OZONE-SITE.XML_hdds.datanode.periodic.disk.check.interval.minutes=1
OZONE-SITE.XML_hdds.datanode.disk.check.min.gap=0s {code}
# Wait for volume to fail health check
Check attached log file for more context.
Suggested workaround diff:
{code:java}
diff --git
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
index 95d1b2c2d..634b15a8e 100644
---
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
+++
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
@@ -624,6 +624,10 @@ public synchronized VolumeCheckResult check(@Nullable
Boolean unused)
return VolumeCheckResult.HEALTHY;
}
+ if (diskCheckDir == null) {
+ diskCheckDir = storageDir;
+ }
+
// Since IO errors may be intermittent, volume remains healthy until the
// threshold of failures is crossed.
boolean diskChecksPassed = DiskCheckUtil.checkReadWrite(storageDir,
{code}
> DiskChecker incorrectly reporting errors
> ----------------------------------------
>
> Key: HDDS-9022
> URL: https://issues.apache.org/jira/browse/HDDS-9022
> Project: Apache Ozone
> Issue Type: Bug
> Affects Versions: 1.3.0
> Reporter: Mladjan Gadzic
> Priority: Major
> Attachments: dn1.log
>
>
> During load test of an aws based ozone cluster, we see that datanodes are
> shutting down. We believe it is caused by the new DiskChecker incorrectly
> reporting errors.
> It happens because the class for StorageVolume's runs the check method
> without initializing the diskCheckDir field.
> Thanks [~xBis] for research and notes on it!
> Diff showing new log messages:
> {code:java}
> diff --git
> a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
>
> b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
> index b267b1d47..1cb0d0085 100644
> ---
> a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
> +++
> b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
> @@ -37,6 +37,8 @@
> * where the disk is mounted.
> */
> public final class DiskCheckUtil {
> + private static final Logger LOG =
> + LoggerFactory.getLogger(DiskCheckUtil.class);
> private DiskCheckUtil() { }
>
> // For testing purposes, an alternate check implementation can be provided
> @@ -63,6 +65,20 @@ public static boolean checkPermissions(File storageDir) {
>
> public static boolean checkReadWrite(File storageDir, File testFileDir,
> int numBytesToWrite) {
> + if (storageDir == null) {
> + LOG.info("###storageDir is null. Printing stack trace: {}",
> + Arrays.toString(new NullPointerException().getStackTrace()));
> + } else {
> + LOG.info("###storageDir path={}", storageDir.getPath());
> + }
> +
> + if (testFileDir == null) {
> + LOG.info("###testFileDir is null. Printing stack trace: {}",
> + Arrays.toString(new NullPointerException().getStackTrace()));
> + } else {
> + LOG.info("###testFileDir path={}", testFileDir.getPath());
> + }
> +
> return impl.checkReadWrite(storageDir, testFileDir, numBytesToWrite);
> }
> {code}
>
> Stacktrace (specifically look at lines with "###")
> {code:java}
> 2023-07-15 10:07:38,006 [Periodic HDDS volume checker] INFO
> org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker:
> Scheduling a check for /hadoop/ozone/data/disk1/datanode/data/hdds
> 2023-07-15 10:07:38,007 [Periodic HDDS volume checker] INFO
> org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker:
> Scheduled health check for volume /hadoop/ozone/data/disk1/datanode/data/hdds
> 2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir
> path=/hadoop/ozone/data/disk1/datanode/data/hdds
> 2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir
> path=/hadoop/ozone/data/disk1/datanode/data/hdds/CID-2eb5a782-379b-46c7-8bd1-8b19043c1a6e/tmp/disk-check
> 2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO
> org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker:
> Scheduling a check for /hadoop/ozone/data/disk1/datanode
> 2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO
> org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker:
> Scheduled health check for volume /hadoop/ozone/data/disk1/datanode
> 2023-07-15 10:07:38,010 [DataNode DiskChecker thread 0] INFO
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir
> path=/hadoop/ozone/data/disk1/datanode
> 2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] INFO
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir
> is null. Printing stack trace:
> [org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:76),
>
> org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629),
>
> org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68),
>
> org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143),
>
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131),
>
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75),
>
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82),
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149),
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624),
> java.lang.Thread.run(Thread.java:750)]
> 2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] ERROR
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: Volume
> /hadoop/ozone/data/disk1/datanode failed health check. Could not find file
> disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb for volume check.
> java.io.FileNotFoundException:
> disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb (No such file or directory)
> at java.io.FileOutputStream.open0(Native Method)
> at java.io.FileOutputStream.open(FileOutputStream.java:270)
> at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
> at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
> at
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil$DiskChecksImpl.checkReadWrite(DiskCheckUtil.java:153)
> at
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:82)
> at
> org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629)
> at
> org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68)
> at
> org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143)
> at
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
> at
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
> at
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:750)
> {code}
>
> Bug could not be reproduced using Docker cluster, only on real cluster. Steps
> to reproduce it:
> # Add/modify properties as such:
> {code:java}
> OZONE-SITE.XML_hdds.datanode.periodic.disk.check.interval.minutes=1
> OZONE-SITE.XML_hdds.datanode.disk.check.min.gap=0s {code}
> # Wait for volume to fail health check
> Check attached log file for more context.
> Suggested workaround diff:
> {code:java}
> diff --git
> a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
>
> b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
> index 95d1b2c2d..634b15a8e 100644
> ---
> a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
> +++
> b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
> @@ -624,6 +624,10 @@ public synchronized VolumeCheckResult check(@Nullable
> Boolean unused)
> return VolumeCheckResult.HEALTHY;
> }
>
> + if (diskCheckDir == null) {
> + diskCheckDir = storageDir;
> + }
> +
> // Since IO errors may be intermittent, volume remains healthy until the
> // threshold of failures is crossed.
> boolean diskChecksPassed = DiskCheckUtil.checkReadWrite(storageDir,
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]