xBis7 commented on PR #5086:
URL: https://github.com/apache/ozone/pull/5086#issuecomment-1641921773
Hey @errose28, this patch addresses another issue, but not the one we are
experiencing. Let me provide some more info. I have tested it both on docker
and on our cluster applying the following diff on top of your patch
```
diff --git
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
index e5f0046c0..fb12d15e4 100644
---
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
+++
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
@@ -236,7 +236,7 @@ public void setBlockDeletionLimit(int limit) {
}
@Config(key = "periodic.disk.check.interval.minutes",
- defaultValue = "60",
+ defaultValue = "2",
type = ConfigType.LONG,
tags = { DATANODE },
description = "Periodic disk check run interval in minutes."
@@ -314,7 +314,7 @@ public void setBlockDeletionLimit(int limit) {
DISK_CHECK_FILE_SIZE_DEFAULT;
@Config(key = "disk.check.min.gap",
- defaultValue = "10m",
+ defaultValue = "1m",
type = ConfigType.TIME,
tags = { DATANODE },
description = "The minimum gap between two successive checks of the
same"
diff --git
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
index b567841c0..81f4a877d 100644
---
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
+++
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
@@ -132,6 +132,13 @@ public boolean checkPermissions(File storageDir) {
public boolean checkReadWrite(File storageDir,
File testFileDir, int numBytesToWrite) {
File testFile = new File(testFileDir, "disk-check-" +
UUID.randomUUID());
+
+ String storageDirPath = storageDir == null ? "null" :
storageDir.getAbsolutePath();
+ String testFileDirPath = testFileDir == null ? "null" :
testFileDir.getAbsolutePath();
+ LOG.info("\nxbis: storageDir: " + storageDirPath +
+ "\nxbis: testFileDir: " + testFileDirPath +
+ "\nxbis: testFile: " + testFile.getAbsolutePath());
+
byte[] writtenBytes = new byte[numBytesToWrite];
RANDOM.nextBytes(writtenBytes);
try (FileOutputStream fos = new FileOutputStream(testFile)) {
```
* Docker logs
```
2023-07-19 10:54:02,032 [Periodic HDDS volume checker] INFO
volume.ThrottledAsyncChecker: Scheduling a check for /data/hdds/hdds
2023-07-19 10:54:02,035 [Periodic HDDS volume checker] INFO
volume.StorageVolumeChecker: Scheduled health check for volume /data/hdds/hdds
2023-07-19 10:54:02,035 [DataNode DiskChecker thread 1] INFO
utils.DiskCheckUtil:
xbis: storageDir: /data/hdds/hdds
xbis: testFileDir:
/data/hdds/hdds/CID-3e323276-e1b0-4069-9fa1-1d93ccab4ff5/tmp/disk-check
xbis: testFile:
/data/hdds/hdds/CID-3e323276-e1b0-4069-9fa1-1d93ccab4ff5/tmp/disk-check/disk-check-8e3d58b6-5b62-49de-825f-36e2f85d0d3a
2023-07-19 10:54:02,042 [Periodic HDDS volume checker] INFO
volume.ThrottledAsyncChecker: Scheduling a check for /data/metadata/ratis
2023-07-19 10:54:02,042 [Periodic HDDS volume checker] INFO
volume.StorageVolumeChecker: Scheduled health check for volume
/data/metadata/ratis
2023-07-19 10:54:02,043 [DataNode DiskChecker thread 1] INFO
utils.DiskCheckUtil:
xbis: storageDir: /data/metadata/ratis
xbis: testFileDir: null
xbis: testFile: /opt/hadoop/disk-check-ec8c7aba-69ec-4f19-bf59-c5b31f843ef1
```
* Our cluster's logs
```
2023-07-19 11:32:33,529 [EndpointStateMachine task thread for
ip-10-0-88-113.eu-west-1.compute.internal/10.0.88.113:9861 - 0 ] INFO
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker:
Scheduling a check for /hadoop/ozone/data/disk1/datanode/data/hdds
2023-07-19 11:32:33,536 [DataNode DiskChecker thread 0] INFO
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil:
xbis: storageDir: /hadoop/ozone/data/disk1/datanode/data/hdds
xbis: testFileDir:
/hadoop/ozone/data/disk1/datanode/data/hdds/CID-e79cbfc1-02dd-4721-a796-e6790132940f/tmp/disk-check
xbis: testFile:
/hadoop/ozone/data/disk1/datanode/data/hdds/CID-e79cbfc1-02dd-4721-a796-e6790132940f/tmp/disk-check/disk-check-fc315b6d-0108-4d02-b443-571083808ad4
2023-07-19 11:32:33,539 [EndpointStateMachine task thread for
ip-10-0-88-113.eu-west-1.compute.internal/10.0.88.113:9861 - 0 ] INFO
org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: Scheduled
health check for volume /hadoop/ozone/data/disk1/datanode/data/hdds
2023-07-19 11:32:33,543 [EndpointStateMachine task thread for
ip-10-0-88-113.eu-west-1.compute.internal/10.0.88.113:9861 - 0 ] INFO
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker:
Scheduling a check for /hadoop/ozone/data/disk1/datanode
2023-07-19 11:32:33,543 [EndpointStateMachine task thread for
ip-10-0-88-113.eu-west-1.compute.internal/10.0.88.113:9861 - 0 ] INFO
org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: Scheduled
health check for volume /hadoop/ozone/data/disk1/datanode
2023-07-19 11:32:33,543 [DataNode DiskChecker thread 0] INFO
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil:
xbis: storageDir: /hadoop/ozone/data/disk1/datanode
xbis: testFileDir: null
xbis: testFile:
/home/ozone/.ansible/tmp/ansible-moduletmp-1689766348.51-vg_D0x/disk-check-a767a491-e84f-4d5b-b04e-fe539dd05c83
2023-07-19 11:32:33,544 [DataNode DiskChecker thread 0] ERROR
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: Volume
/hadoop/ozone/data/disk1/datanode failed health check. Could not find file
/home/ozone/.ansible/tmp/ansible-moduletmp-1689766348.51-vg_D0x/disk-check-a767a491-e84f-4d5b-b04e-fe539dd05c83
for volume check.
java.io.FileNotFoundException:
disk-check-a767a491-e84f-4d5b-b04e-fe539dd05c83 (No such file or directory)
```
`testFileDir` get's initialized for an HddsVolume and a DbVolume but not a
StorageVolume. For a StorageVolume, its value is null and that results in
creating the `testFile` to the current working directory of the java
application. For docker that's `/opt/hadoop/` which is fine but for our cluster
the current dir is not writable. That might be a common case for others as
well. IMO, for a StorageVolume we should provide a specific path on disk or
create a separate tmp dir or set the `testFileDir` to point to `storageDir`.
`VersionEndpointTask` calls `checkVolumeSet`
[here](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/states/endpoint/VersionEndpointTask.java#L84-L88)
only for DbVolumes and HddsVolumes but not StorageVolumes. We end up calling
`checkVolume` which initializes tmp dirs
[here](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/StorageVolumeUtil.java#L273).
That's why there is no value for a StorageVolume.
BTW, this is an aws cluster that has been set up with ansible.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]