errose28 commented on code in PR #4867:
URL: https://github.com/apache/ozone/pull/4867#discussion_r1244214064
##########
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/HddsVolume.java:
##########
@@ -310,20 +310,32 @@ public void cleanDeletedContainerDir() {
}
@Override
- public VolumeCheckResult check(@Nullable Boolean unused) throws Exception {
+ public synchronized VolumeCheckResult check(@Nullable Boolean unused)
+ throws Exception {
VolumeCheckResult result = super.check(unused);
- if (!isDbLoaded()) {
- return result;
- }
+
DatanodeConfiguration df =
getConf().getObject(DatanodeConfiguration.class);
if (result != VolumeCheckResult.HEALTHY ||
- !df.getContainerSchemaV3Enabled() || !df.autoCompactionSmallSstFile())
{
+ !df.getContainerSchemaV3Enabled() || !isDbLoaded()) {
return result;
}
- // Calculate number of files per level and size per level
- RawDB rawDB = DatanodeStoreCache.getInstance().getDB(
- new File(dbParentDir, CONTAINER_DB_NAME).getAbsolutePath(), getConf());
- rawDB.getStore().compactionIfNeeded();
+
+ // Check that per-volume RocksDB is present.
+ File dbFile = new File(dbParentDir, CONTAINER_DB_NAME);
+ if (!dbFile.exists() || !dbFile.canRead()) {
+ LOG.warn("Volume {} failed health check. Could not access RocksDB at " +
+ "{}", getStorageDir(), dbFile);
+ return VolumeCheckResult.FAILED;
+ }
+
+ // TODO HDDS-8784 trigger compaction outside of volume check. Then the
+ // exception can be removed.
+ if (df.autoCompactionSmallSstFile()) {
+ // Calculate number of files per level and size per level
+ RawDB rawDB = DatanodeStoreCache.getInstance().getDB(
Review Comment:
Yes, unfortunately the disk check and compaction are coupled together right
now. have filed HDDS-8784 to separate them, but for now this is a concern. The
compaction is only triggered when sst files below the size threshold are found,
so this shouldn't trigger unnecessary compaction, but it will mean we are
iterating all sst file metadata more frequently.
On the flip side, 15 min disk check gap with 3 IO tests means it could take
up to 45 minutes to detect a volume failure if the directory checks keep
passing due to cached info. I think 5 minutes would be a good gap if the
compaction check were not here, but for what we have currently maybe we split
the difference with a 10 minute disk check gap?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]