xBis7 commented on PR #5086:
URL: https://github.com/apache/ozone/pull/5086#issuecomment-1641921773

   Hey @errose28, this patch addresses another issue, but not the one we are 
experiencing. Let me provide some more info. I have tested it both on docker 
and on our cluster applying the following diff on top of your patch
   
   ```
   diff --git 
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
 
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
   index e5f0046c0..fb12d15e4 100644
   --- 
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
   +++ 
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
   @@ -236,7 +236,7 @@ public void setBlockDeletionLimit(int limit) {
      }
    
      @Config(key = "periodic.disk.check.interval.minutes",
   -      defaultValue = "60",
   +      defaultValue = "2",
          type = ConfigType.LONG,
          tags = { DATANODE },
          description = "Periodic disk check run interval in minutes."
   @@ -314,7 +314,7 @@ public void setBlockDeletionLimit(int limit) {
          DISK_CHECK_FILE_SIZE_DEFAULT;
    
      @Config(key = "disk.check.min.gap",
   -      defaultValue = "10m",
   +      defaultValue = "1m",
          type = ConfigType.TIME,
          tags = { DATANODE },
          description = "The minimum gap between two successive checks of the 
same"
   diff --git 
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
 
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
   index b567841c0..81f4a877d 100644
   --- 
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
   +++ 
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
   @@ -132,6 +132,13 @@ public boolean checkPermissions(File storageDir) {
        public boolean checkReadWrite(File storageDir,
            File testFileDir, int numBytesToWrite) {
          File testFile = new File(testFileDir, "disk-check-" + 
UUID.randomUUID());
   +
   +      String storageDirPath = storageDir == null ? "null" : 
storageDir.getAbsolutePath();
   +      String testFileDirPath = testFileDir == null ? "null" : 
testFileDir.getAbsolutePath();
   +      LOG.info("\nxbis: storageDir: " + storageDirPath +
   +          "\nxbis: testFileDir: " + testFileDirPath +
   +          "\nxbis: testFile: " + testFile.getAbsolutePath());
   +
          byte[] writtenBytes = new byte[numBytesToWrite];
          RANDOM.nextBytes(writtenBytes);
          try (FileOutputStream fos = new FileOutputStream(testFile)) {
   ```
   
   * Docker logs
     ```
     2023-07-19 10:54:02,032 [Periodic HDDS volume checker] INFO 
volume.ThrottledAsyncChecker: Scheduling a check for /data/hdds/hdds
     2023-07-19 10:54:02,035 [Periodic HDDS volume checker] INFO 
volume.StorageVolumeChecker: Scheduled health check for volume /data/hdds/hdds
     2023-07-19 10:54:02,035 [DataNode DiskChecker thread 1] INFO 
utils.DiskCheckUtil: 
     xbis: storageDir: /data/hdds/hdds
     xbis: testFileDir: 
/data/hdds/hdds/CID-3e323276-e1b0-4069-9fa1-1d93ccab4ff5/tmp/disk-check
     xbis: testFile: 
/data/hdds/hdds/CID-3e323276-e1b0-4069-9fa1-1d93ccab4ff5/tmp/disk-check/disk-check-8e3d58b6-5b62-49de-825f-36e2f85d0d3a
     2023-07-19 10:54:02,042 [Periodic HDDS volume checker] INFO 
volume.ThrottledAsyncChecker: Scheduling a check for /data/metadata/ratis
     2023-07-19 10:54:02,042 [Periodic HDDS volume checker] INFO 
volume.StorageVolumeChecker: Scheduled health check for volume 
/data/metadata/ratis
     2023-07-19 10:54:02,043 [DataNode DiskChecker thread 1] INFO 
utils.DiskCheckUtil: 
     xbis: storageDir: /data/metadata/ratis
     xbis: testFileDir: null
     xbis: testFile: /opt/hadoop/disk-check-ec8c7aba-69ec-4f19-bf59-c5b31f843ef1
     ```
   
   * Our cluster's logs
   ```
   2023-07-19 11:32:33,529 [EndpointStateMachine task thread for 
ip-10-0-88-113.eu-west-1.compute.internal/10.0.88.113:9861 - 0 ] INFO 
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: 
Scheduling a check for /hadoop/ozone/data/disk1/datanode/data/hdds
   2023-07-19 11:32:33,536 [DataNode DiskChecker thread 0] INFO 
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: 
   xbis: storageDir: /hadoop/ozone/data/disk1/datanode/data/hdds
   xbis: testFileDir: 
/hadoop/ozone/data/disk1/datanode/data/hdds/CID-e79cbfc1-02dd-4721-a796-e6790132940f/tmp/disk-check
   xbis: testFile: 
/hadoop/ozone/data/disk1/datanode/data/hdds/CID-e79cbfc1-02dd-4721-a796-e6790132940f/tmp/disk-check/disk-check-fc315b6d-0108-4d02-b443-571083808ad4
   2023-07-19 11:32:33,539 [EndpointStateMachine task thread for 
ip-10-0-88-113.eu-west-1.compute.internal/10.0.88.113:9861 - 0 ] INFO 
org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: Scheduled 
health check for volume /hadoop/ozone/data/disk1/datanode/data/hdds
   2023-07-19 11:32:33,543 [EndpointStateMachine task thread for 
ip-10-0-88-113.eu-west-1.compute.internal/10.0.88.113:9861 - 0 ] INFO 
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: 
Scheduling a check for /hadoop/ozone/data/disk1/datanode
   2023-07-19 11:32:33,543 [EndpointStateMachine task thread for 
ip-10-0-88-113.eu-west-1.compute.internal/10.0.88.113:9861 - 0 ] INFO 
org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: Scheduled 
health check for volume /hadoop/ozone/data/disk1/datanode
   2023-07-19 11:32:33,543 [DataNode DiskChecker thread 0] INFO 
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: 
   xbis: storageDir: /hadoop/ozone/data/disk1/datanode
   xbis: testFileDir: null
   xbis: testFile: 
/home/ozone/.ansible/tmp/ansible-moduletmp-1689766348.51-vg_D0x/disk-check-a767a491-e84f-4d5b-b04e-fe539dd05c83
   2023-07-19 11:32:33,544 [DataNode DiskChecker thread 0] ERROR 
org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: Volume 
/hadoop/ozone/data/disk1/datanode failed health check. Could not find file 
/home/ozone/.ansible/tmp/ansible-moduletmp-1689766348.51-vg_D0x/disk-check-a767a491-e84f-4d5b-b04e-fe539dd05c83
 for volume check.
   java.io.FileNotFoundException: 
disk-check-a767a491-e84f-4d5b-b04e-fe539dd05c83 (No such file or directory)
   ```
   
   `testFileDir` get's initialized for an HddsVolume and a DbVolume but not a 
StorageVolume. For a StorageVolume, its value is null and that results in 
creating the `testFile` to the current working directory of the java 
application. For docker that's `/opt/hadoop/` which is fine but for our cluster 
the current dir is not writable. That might be a common case for others as 
well. IMO, for a StorageVolume we should provide a specific path on disk or 
create a separate tmp dir or set the `testFileDir` to point to `storageDir`. 
   
   `VersionEndpointTask` calls `checkVolumeSet` 
[here](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/states/endpoint/VersionEndpointTask.java#L84-L88)
 only for DbVolumes and HddsVolumes but not StorageVolumes. We end up calling 
`checkVolume` which initializes tmp dirs 
[here](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/StorageVolumeUtil.java#L273).
 That's why there is no value for a StorageVolume.
   
   BTW, this is an aws cluster that has been set up with ansible.
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to