Ashwani Raina has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/24092 )

Change subject: [fs] add metrics for untracked orphaned blocks
......................................................................


Patch Set 5:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/24092/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/24092/5//COMMIT_MSG@35
PS5, Line 35: severly
> severely
Done


http://gerrit.cloudera.org:8080/#/c/24092/5//COMMIT_MSG@53
PS5, Line 53: backgrounnd
> background
Done


http://gerrit.cloudera.org:8080/#/c/24092/5//COMMIT_MSG@57
PS5, Line 57: warning
> For some reason, those are ERROR logs, not WARNING.  Is it intentional?
Changed back to WARNING as it was before.


http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/fs/log_block_manager.cc@a3343
PS5, Line 3343:
> This comment is removed in PS5, but the corresponding metrics aren't presen
There is no GC'ing logic present in PS5. So, any block for which deletion 
failed, it will be accounted for in OrphanBlockCleanupStats.blocks_failed which 
gets updated in orphaned_block_cleanup_failures metric.

In other words, if deletion failed for a set of blocks due to any I/O error, 
metrics orphaned_block_cleanup_failures represents the number of such orphaned 
blocks that couldn't be cleaned up.

An orphaned block is the one which doesn't have any entry in metadata and this 
is an short-lived intermediate stage of blocks (to be deleted) before these get 
deleted in normal workflow (without any GC in picture). Generally, you won't 
see any orphaned blocks in a kudu cluster which is running fine and doesn't 
face any issues. It is only when this intermediate stage gets converted into a 
permanent state of the orphaned blocks because regular clean-up workflow failed 
for them. This is represented by orphaned_block_cleanup_failures metric.

GC'ing is an additional logic on top of this ever existing clean-up code 
(tracked under https://gerrit.cloudera.org/#/c/24125)

orphaned_block_cleanup_failures represents the metric to track the number of 
long-living orphaned blocks which is what TODO is talking about.


http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/fs/log_block_manager.cc@3350
PS5, Line 3350:       // Blocks whose deletion metadata could not be written 
are permanently
              :       // orphaned i.e., their data occupies space on disk. They 
will not be cleaned
              :       // up automatically and may cause unclaimed disk space 
accumulation over time.
> Shouldn't we track those using metrics as well?
This is not different from what is being tracked under 
orphaned_block_cleanup_failures metrics. If you see the callers in the cleanup 
path context, Tablet::FlushMetadata makes use of deleted_block_ids (in the form 
of OrphanBlockCleanupStats::blocks_cleaned and 
OrphanBlockCleanupStats::blocks_failed).


http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/fs/log_block_manager.cc@3357
PS5, Line 3357: int
> Can we ever expect an overflow here with 32-bit integers?
Changed to size_t, aligns with size() return type.


http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/fs/log_block_manager.cc@3360
PS5, Line 3360:             "Block $0 in container $1 could not have its 
deletion record "
              :             "committed and is now an orphaned block that 
occupies "
              :             "untracked disk space: $2. Run 'kudu fs check 
--repair' to "
              :             "reclaim this disk space.
> Could you make this message more concise, but still actionable?
Done


http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/fs/log_block_manager.cc@3366
PS5, Line 3366:       // Always emit a summary so the total orphaned count is 
never suppressed,
              :       // even when the per-block messages above are 
rate-limited.
> What if there are thousands of containers there since that was a failed dat
Made some modification to time-limit these messages as well and have a 
aggregate message outside the loop.


http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/fs/log_block_manager.cc@3369
PS5, Line 3369:           "$0 block(s) in container $1 could not have their 
deletion records "
              :           "committed and are now orphaned blocks occupying 
unreclaimed disk space: "
              :           "$2. Run 'kudu fs check --repair' to reclaim this 
disk space.",
> nit: make this more concise, but still actionable.
Done


http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/fs/log_block_manager.cc@3417
PS5, Line 3417:       KLOG_EVERY_N_SECS(INFO, 1) << Substitute("Block $0 is in 
a failed directory; not deleting",
> Is this update also related to this patch?  If not, avoid introducing such
In any case, this needs to change because src/kudu/util/logging.h (included in 
this file) undefs LOG_EVERY_N and throws following  compilation error:
++
GOOGLE_GLOG_COMPILE_ASSERT(false, "LOG_EVERY_N is deprecated. Please use 
KLOG_EVERY_N.")
++

Alternatively, I can change this to:
++
KLOG_EVERY_N(INFO, 10) << Substitute("Block $0 is in a failed directory; not 
deleting", block_id.ToString());
++
However, clang throws 'readability-isolate-declaration' warning for this 
because KLOG_EVERY_N expand to KUDU_SOME_KIND_OF_LOG_EVERY_N macros which has 
two declaration in a statement i.e.
++
static int LOG_OCCURRENCES = 0, LOG_OCCURRENCES_MOD_N = 0; \
++

I have changed to KLOG_EVERY_N, along with changes in src/kudu/util/logging.h 
to ensure single declaration in a statement and some other changes as well. 
Frankly, it is too many adjustments for just a single line change.

All of this circus goes away if time-based logging is chosen. Not to mention 
the additional benefit of predictable log bounding.


http://gerrit.cloudera.org:8080/#/c/24092/2/src/kudu/tablet/tablet_metadata-test.cc
File src/kudu/tablet/tablet_metadata-test.cc:

http://gerrit.cloudera.org:8080/#/c/24092/2/src/kudu/tablet/tablet_metadata-test.cc@454
PS2, Line 454:     // therefore unaffected. Also, Tablet::Flush() still returns 
OK overall.
> nit: if possible, please avoid updating non-related code since it introduce
Not sure what non-related code you are talking about. This is a newly added 
unit test to verify the behavior by injecting EIO on every operation under this 
directory, a requirement that is related to this patch.


http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/tablet/tablet_metadata.cc
File src/kudu/tablet/tablet_metadata.cc:

http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/tablet/tablet_metadata.cc@562
PS5, Line 562: Scheduling $0 block(s) for orphaned cleanup
> nit: ... scheduling $0 orphaned block(s) for cleanup ... ?
Done


http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/tablet/tablet_metadata.cc@575
PS5, Line 575:     LOG_WITH_PREFIX(WARNING) << "Not deleting " << blocks.size()
             :         << " orphaned block(s) from disk. Block deletion 
disabled via "
             :         << "--enable_tablet_orphaned_block_deletion=false";
> nit: if updating this message, please switch to using Substitute() here as
Done


http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/tablet/tablet_metadata.cc@581
PS5, Line 581:   OrphanBlockCleanupStats stats;
> nit: move this closer to the place of the usage
Done


http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/tablet/tablet_metadata.cc@595
PS5, Line 595: ERROR
> There used to be a log message of WARNING priority.  Is the change to the E
Changed to WARNING


http://gerrit.cloudera.org:8080/#/c/24092/5/src/kudu/tablet/tablet_metadata.cc@596
PS5, Line 596:         "Failed to clean up $0/$1 orphaned block(s): $2. The 
failed block(s) are removed from "
             :         "the superblock but their disk space may be unreclaimed. 
Run 'kudu fs check --repair' "
             :         "to manually reclaim the disk space.
> nit: please make it more concise while keeping it actionable, e.g.
Done



--
To view, visit http://gerrit.cloudera.org:8080/24092
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id386d9fc8d0900839e229e66772f35299b3ef2e9
Gerrit-Change-Number: 24092
Gerrit-PatchSet: 5
Gerrit-Owner: Ashwani Raina <[email protected]>
Gerrit-Reviewer: Abhishek Chennaka <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Ashwani Raina <[email protected]>
Gerrit-Reviewer: Attila Bukor <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Marton Greber <[email protected]>
Gerrit-Reviewer: Zoltan Martonka <[email protected]>
Gerrit-Comment-Date: Thu, 09 Apr 2026 06:28:48 +0000
Gerrit-HasComments: Yes

Reply via email to