[GitHub] [ozone] errose28 commented on a diff in pull request #4867: HDDS-8782. Improve Volume Scanner Health checks.

via GitHub Tue, 27 Jun 2023 11:58:23 -0700


errose28 commented on code in PR #4867:
URL: https://github.com/apache/ozone/pull/4867#discussion_r1244214064



##########
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/HddsVolume.java:
##########
@@ -310,20 +310,32 @@ public void cleanDeletedContainerDir() {
   }
 
   @Override
-  public VolumeCheckResult check(@Nullable Boolean unused) throws Exception {
+  public synchronized VolumeCheckResult check(@Nullable Boolean unused)
+      throws Exception {
     VolumeCheckResult result = super.check(unused);
-    if (!isDbLoaded()) {
-      return result;
-    }
+
     DatanodeConfiguration df = 
getConf().getObject(DatanodeConfiguration.class);
     if (result != VolumeCheckResult.HEALTHY ||
-        !df.getContainerSchemaV3Enabled() || !df.autoCompactionSmallSstFile()) 
{
+        !df.getContainerSchemaV3Enabled() || !isDbLoaded()) {
       return result;
     }
-    // Calculate number of files per level and size per level
-    RawDB rawDB = DatanodeStoreCache.getInstance().getDB(
-        new File(dbParentDir, CONTAINER_DB_NAME).getAbsolutePath(), getConf());
-    rawDB.getStore().compactionIfNeeded();
+
+    // Check that per-volume RocksDB is present.
+    File dbFile = new File(dbParentDir, CONTAINER_DB_NAME);
+    if (!dbFile.exists() || !dbFile.canRead()) {
+      LOG.warn("Volume {} failed health check. Could not access RocksDB at " +
+          "{}", getStorageDir(), dbFile);
+      return VolumeCheckResult.FAILED;
+    }
+
+    // TODO HDDS-8784 trigger compaction outside of volume check. Then the
+    //  exception can be removed.
+    if (df.autoCompactionSmallSstFile()) {
+      // Calculate number of files per level and size per level
+      RawDB rawDB = DatanodeStoreCache.getInstance().getDB(

Review Comment:
   Yes, unfortunately the disk check and compaction are coupled together right 
now.  have filed HDDS-8784 to separate them, but for now this is a concern. The 
compaction is only triggered when sst files below the size threshold are found, 
so this shouldn't trigger unnecessary compaction, but it will mean we are 
iterating all sst file metadata more frequently.
   
   On the flip side, 15 min disk check gap with 3 IO tests means it could take 
up to 45 minutes to detect a volume failure if the directory checks keep 
passing due to cached info. I think 5 minutes would be a good gap if the 
compaction check were not here, but for what we have currently maybe we split 
the difference with a 10 minute disk check gap?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [ozone] errose28 commented on a diff in pull request #4867: HDDS-8782. Improve Volume Scanner Health checks.

Reply via email to