Hello Kudu Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/24092

to look at the new patch set (#2).

Change subject: [fs] add metrics for untracked orphaned blocks
......................................................................

[fs] add metrics for untracked orphaned blocks

During flush, if orphaned blocks are there and for some reason those
blocks' records are not deleted (maybe due to some underlying transient
I/O error, etc), those blocks are never re-tried for deletion and erased
from the list of orphaned_blocks_. This can cause un-reclaimed space on
the persistent storage that can get accumulated over time to a big size.
This un-reclaimed space can only be reclaimed with user intervention by
running 'kudu fs check --repair'.

Initially, this patch addressed the issue by not removing those orphaned
blocks from the orphaned_blocks_ list, for which delete failed due to
some transient error. This ensures that when next flush is scheduled,
the metadata still contains those orphaned blocks in the list and delete
can be retried on those blocks to reclaim the space.

However, with that change, TestEIODuringDelete started failing because
DeleteTabletData expects an empty orphaned_blocks_ set after calling
Flush() couple of times. A second flush is required to clear the
orphaned blocks from superblock even if delete failed for those earlier.
Since we keep adding the orphaned blocks (for which delete failed)
back to the set, it never got an empty set, that caused the failure.

If orphaned_blocks_ set is cleared after the second flush, to maintain
original behavior and expectation, it may fix the TestEIODuringDelete
failure but it opens the possibility of hitting KUDU-1060 where a number
tombstoned tablets keep the record of all orphaned blocks on persistent
superblock i.e., roll-forward of the block deletions until next restart.
This can severly impact the startup time for a tablet server that has
a lot of tombstoned tablets.

In a nutshell, having to keep these orphaned blocks in the set may not
be of much use if the disk error is persistent and not to mention the
additional handling required in various cases like mentioned above. It
makes sense to just rely on 'kudu fs check --repair' like workflow to
remove these stale orphaned blocks as a maintenance operation.

With all this information, re-purposing this patch to only focus on
adding additional logs, metrics and stats that can help identify the
scenarios with stale orphaned block id lists and log the appropriate
action for user i.e. 'kudu fs check --repair'. Original behavior holds
for orphaned blocks set i.e. erase all blocks from the set irrespective
of the CommitDeletedBlocks() outcome.

Follow-up:
KUDU-829 - Create a separate patch for adding a maintenance op that can
run in the backgrounnd to reclaim all the space left from those orphaned
blocks using the same logic as 'kudu fs check --repair'.

Highlights of the change:
- Add warning logs at CommitDeletedBlocks callers when the commit fails.
- Add per-block level error logs for blocks for which deletion record
  could not be committed.
- Add these metrics that can hold the orphan block deletion outcome:
  * orphaned_blocks_cleaned
  * orphaned_block_cleanup_failures
  * orphaned_block_cleanup_failures_bytes
- Add unit tests to test these scenarios with metrics verification:
  * Usual path where orphaned blocks are deleted with no error.
  * No metrics are updated when orphan block deletion is disabled.
  * Induced I/O error causes orphaned blocks lying around that are not
    deleted and eventually removed from the set with a action to user.
  * Two stages: First induce I/O error, verify metrics shows cleanup
    failure count increased. Second remove induced error, verify metrics
    remains unchanged.

Change-Id: Id386d9fc8d0900839e229e66772f35299b3ef2e9
---
M src/kudu/fs/log_block_manager.cc
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_metadata-test.cc
M src/kudu/tablet/tablet_metadata.cc
M src/kudu/tablet/tablet_metadata.h
M src/kudu/tablet/tablet_metrics.cc
M src/kudu/tablet/tablet_metrics.h
8 files changed, 396 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/92/24092/2
--
To view, visit http://gerrit.cloudera.org:8080/24092
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id386d9fc8d0900839e229e66772f35299b3ef2e9
Gerrit-Change-Number: 24092
Gerrit-PatchSet: 2
Gerrit-Owner: Ashwani Raina <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)

Reply via email to