[
https://issues.apache.org/jira/browse/KUDU-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15387075#comment-15387075
]
Todd Lipcon commented on KUDU-1538:
-----------------------------------
A couple thoughts here:
- the above stuff is trying hard to avoid block leaks in the case of crashing
just after a metadata flush, but we already have the opposite leak in the case
of a crash just before a metadata flush (the in-progress blocks being written
as the compaction output are "committed" in the block manager but not
referenced anywhere). So, even despite our best efforts, we _still_ have to
worry about a more thorough (eg mark-and-sweep-style) "garbage collector" for
blocks (KUDU-829). Maybe we should just throw away this best effort and accept
that our current offering is 'data leaky' and come up with a better holistic
solution?
- the fact that we use randomized block IDs instead of sequential block IDs
makes reuse much more plausible. With sequentially-allocated IDs, we'd have to
"wrap around" our extremely large space to make this an issue, which is _way_
less likely. (I actually had a patch back in 2014 to do this, with some other
benefits, but it only was for the FBM)
- maybe we need to "reserve" those block IDs in the block manager until they're
actually fully removed from the metadata? worried that this could be quite
complex, though.
- maybe a more 'WAL-like' way of doing the roll-forward, tied to specific
revisions of the TabletMetadata, is the way to go?
> "Orphaned" block deletion can delete live blocks in use by other tablets
> ------------------------------------------------------------------------
>
> Key: KUDU-1538
> URL: https://issues.apache.org/jira/browse/KUDU-1538
> Project: Kudu
> Issue Type: Bug
> Components: fs, tablet
> Affects Versions: 0.9.1
> Reporter: Todd Lipcon
> Priority: Blocker
>
> Currently, we allocate block IDs using a random number generator, ensuring
> that the blocks we allocate are not already in use. Of course that doesn't
> proclude a block which was previously used and then deleted from having its
> ID reused.
> This interacts quite poorly with the "orphaned block" processing we have in
> tablet metadata. As a refresher, the "orphaned block" thing is used as
> follows:
> - during a compaction, we have the output blocks (newly written data) and the
> input blocks (data which has been compacted and no longer relevant)
> - when the compaction finishes, we write a new TabletMetadata which swaps in
> the new blocks and removes the old blocks
> -- followed by that, we delete the old (input) blocks. Of course we can't
> delete the old blocks until after we've flushed the metadata, or else if we
> crashed before flushing the metadata we'd have lost track of the new block
> IDs.
> -- so, we defer the deletion of the input blocks until after the metadata has
> been flushed
> - this leaves open the opposite hole: if we defer the deletion of the old
> blocks, and we crash just _after_ flushing metadata, we would leak those old
> blocks and their disk space, which is no good either.
> -- so, when we flush metadata, we include the 'old blocks' in a
> 'orphan_blocks' array. On loading of metadata, we try to 'roll forward' the
> deletion to prevent the above-mentioned leak from being permanent.
> The "roll forward" behavior mentioned above is what seems to be eating
> blocks. We can now have the following bad interleaving:
> - a compaction in tablet A succeeds and lists block ID "X" as orphaned
> - a different tablet B re-uses block ID "X"
> - we restart the TS, or trigger a remote bootstrap (which also "cleans up"
> orphan blocks)
> -- it deletes block "X" from underneath tablet "B"
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)