Ashwani Raina has posted comments on this change. ( http://gerrit.cloudera.org:8080/24092 )
Change subject: [fs] add metrics for untracked orphaned blocks ...................................................................... Patch Set 5: (16 comments) http://gerrit.cloudera.org:8080/#/c/24092/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/24092/5//COMMIT_MSG@35 PS5, Line 35: severly > severely Done http://gerrit.cloudera.org:8080/#/c/24092/5//COMMIT_MSG@53 PS5, Line 53: backgrounnd > background Done http://gerrit.cloudera.org:8080/#/c/24092/5//COMMIT_MSG@57 PS5, Line 57: warning > For some reason, those are ERROR logs, not WARNING. Is it intentional? Changed back to WARNING as it was before. http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/fs/log_block_manager.cc File src/kudu/fs/log_block_manager.cc: http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/fs/log_block_manager.cc@a3343 PS5, Line 3343: > This comment is removed in PS5, but the corresponding metrics aren't presen There is no GC'ing logic present in PS5. So, any block for which deletion failed, it will be accounted for in OrphanBlockCleanupStats.blocks_failed which gets updated in orphaned_block_cleanup_failures metric. In other words, if deletion failed for a set of blocks due to any I/O error, metrics orphaned_block_cleanup_failures represents the number of such orphaned blocks that couldn't be cleaned up. An orphaned block is the one which doesn't have any entry in metadata and this is an short-lived intermediate stage of blocks (to be deleted) before these get deleted in normal workflow (without any GC in picture). Generally, you won't see any orphaned blocks in a kudu cluster which is running fine and doesn't face any issues. It is only when this intermediate stage gets converted into a permanent state of the orphaned blocks because regular clean-up workflow failed for them. This is represented by orphaned_block_cleanup_failures metric. GC'ing is an additional logic on top of this ever existing clean-up code (tracked under https://gerrit.cloudera.org/#/c/24125) orphaned_block_cleanup_failures represents the metric to track the number of long-living orphaned blocks which is what TODO is talking about. http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/fs/log_block_manager.cc@3350 PS5, Line 3350: // Blocks whose deletion metadata could not be written are permanently : // orphaned i.e., their data occupies space on disk. They will not be cleaned : // up automatically and may cause unclaimed disk space accumulation over time. > Shouldn't we track those using metrics as well? This is not different from what is being tracked under orphaned_block_cleanup_failures metrics. If you see the callers in the cleanup path context, Tablet::FlushMetadata makes use of deleted_block_ids (in the form of OrphanBlockCleanupStats::blocks_cleaned and OrphanBlockCleanupStats::blocks_failed). http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/fs/log_block_manager.cc@3357 PS5, Line 3357: int > Can we ever expect an overflow here with 32-bit integers? Changed to size_t, aligns with size() return type. http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/fs/log_block_manager.cc@3360 PS5, Line 3360: "Block $0 in container $1 could not have its deletion record " : "committed and is now an orphaned block that occupies " : "untracked disk space: $2. Run 'kudu fs check --repair' to " : "reclaim this disk space. > Could you make this message more concise, but still actionable? Done http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/fs/log_block_manager.cc@3366 PS5, Line 3366: // Always emit a summary so the total orphaned count is never suppressed, : // even when the per-block messages above are rate-limited. > What if there are thousands of containers there since that was a failed dat Made some modification to time-limit these messages as well and have a aggregate message outside the loop. http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/fs/log_block_manager.cc@3369 PS5, Line 3369: "$0 block(s) in container $1 could not have their deletion records " : "committed and are now orphaned blocks occupying unreclaimed disk space: " : "$2. Run 'kudu fs check --repair' to reclaim this disk space.", > nit: make this more concise, but still actionable. Done http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/fs/log_block_manager.cc@3417 PS5, Line 3417: KLOG_EVERY_N_SECS(INFO, 1) << Substitute("Block $0 is in a failed directory; not deleting", > Is this update also related to this patch? If not, avoid introducing such In any case, this needs to change because src/kudu/util/logging.h (included in this file) undefs LOG_EVERY_N and throws following compilation error: ++ GOOGLE_GLOG_COMPILE_ASSERT(false, "LOG_EVERY_N is deprecated. Please use KLOG_EVERY_N.") ++ Alternatively, I can change this to: ++ KLOG_EVERY_N(INFO, 10) << Substitute("Block $0 is in a failed directory; not deleting", block_id.ToString()); ++ However, clang throws 'readability-isolate-declaration' warning for this because KLOG_EVERY_N expand to KUDU_SOME_KIND_OF_LOG_EVERY_N macros which has two declaration in a statement i.e. ++ static int LOG_OCCURRENCES = 0, LOG_OCCURRENCES_MOD_N = 0; \ ++ I have changed to KLOG_EVERY_N, along with changes in src/kudu/util/logging.h to ensure single declaration in a statement and some other changes as well. Frankly, it is too many adjustments for just a single line change. All of this circus goes away if time-based logging is chosen. Not to mention the additional benefit of predictable log bounding. http://gerrit.cloudera.org:8080/#/c/24092/2/src/kudu/tablet/tablet_metadata-test.cc File src/kudu/tablet/tablet_metadata-test.cc: http://gerrit.cloudera.org:8080/#/c/24092/2/src/kudu/tablet/tablet_metadata-test.cc@454 PS2, Line 454: // therefore unaffected. Also, Tablet::Flush() still returns OK overall. > nit: if possible, please avoid updating non-related code since it introduce Not sure what non-related code you are talking about. This is a newly added unit test to verify the behavior by injecting EIO on every operation under this directory, a requirement that is related to this patch. http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/tablet/tablet_metadata.cc@562 PS5, Line 562: Scheduling $0 block(s) for orphaned cleanup > nit: ... scheduling $0 orphaned block(s) for cleanup ... ? Done http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/tablet/tablet_metadata.cc@575 PS5, Line 575: LOG_WITH_PREFIX(WARNING) << "Not deleting " << blocks.size() : << " orphaned block(s) from disk. Block deletion disabled via " : << "--enable_tablet_orphaned_block_deletion=false"; > nit: if updating this message, please switch to using Substitute() here as Done http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/tablet/tablet_metadata.cc@581 PS5, Line 581: OrphanBlockCleanupStats stats; > nit: move this closer to the place of the usage Done http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/tablet/tablet_metadata.cc@595 PS5, Line 595: ERROR > There used to be a log message of WARNING priority. Is the change to the E Changed to WARNING http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/tablet/tablet_metadata.cc@596 PS5, Line 596: "Failed to clean up $0/$1 orphaned block(s): $2. The failed block(s) are removed from " : "the superblock but their disk space may be unreclaimed. Run 'kudu fs check --repair' " : "to manually reclaim the disk space. > nit: please make it more concise while keeping it actionable, e.g. Done -- To view, visit http://gerrit.cloudera.org:8080/24092 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Id386d9fc8d0900839e229e66772f35299b3ef2e9 Gerrit-Change-Number: 24092 Gerrit-PatchSet: 5 Gerrit-Owner: Ashwani Raina <[email protected]> Gerrit-Reviewer: Abhishek Chennaka <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Ashwani Raina <[email protected]> Gerrit-Reviewer: Attila Bukor <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Marton Greber <[email protected]> Gerrit-Reviewer: Zoltan Martonka <[email protected]> Gerrit-Comment-Date: Thu, 09 Apr 2026 06:28:48 +0000 Gerrit-HasComments: Yes
