[
https://issues.apache.org/jira/browse/CASSANDRA-16634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332805#comment-17332805
]
Brandon Williams commented on CASSANDRA-16634:
----------------------------------------------
[~marcuse] can you take a look?
> Garbagecollect should not output all tables to L0 with
> LeveledCompactionStrategy
> --------------------------------------------------------------------------------
>
> Key: CASSANDRA-16634
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16634
> Project: Cassandra
> Issue Type: Bug
> Components: Local/Compaction
> Reporter: Scott Carey
> Assignee: Scott Carey
> Priority: Normal
> Fix For: 3.11.x, 4.0.x
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> nodetool garbagecollect always outputs to L0 with LeveledCompactionStrategy.
> This is awful. On a large LCS table, this means that at the end of the
> garbagecollect process, all data is in L0.
>
> This results in an awful sequence of useless temporary space usage and write
> amplification:
> # L0 is repeatedly size-tiered compacted until it doesn't have too many
> SSTables. If the original LCS table had 2000 tables... this takes a long time
> # L0 is compacted to L1 in one to a couple very very large compactions
> # L1 is compacted to L2, L3 to L4, etc. Write amplification galore
> Due to the above, 'nodetool garbagecollect' is close to worthless for large
> LCS tables. A full compaction is always less write amplification and similar
> temp disk space required. The only exception is if you can use 'nodetool
> garbagecolect' part-way, and then use 'nodetool stop' to cancel it before L0
> is too large. In this case if you are lucky, and the order that it chose to
> process SSTables coincides with tables that have the most disk space to
> clear, you might free up enough disk space to succeed in your original goal.
>
> However, from what I can tell, there is no good reason to move the output to
> L0. Leaving the output table in the same SSTableLevel as the source table
> does not violate any of the LeveledCompactionStrategy placement rules, as the
> output by definition has a token range equal to or smaller than the source.
> The only drawback is if the size of the output files is significantly smaller
> than the source, in which case the source level would be under-sized. But
> that seems like a problem that LCS has to handle, not garbagecollect.
> LCS could have a "pull up" operation where it does something like the
> following. Assume a table has L4 as the max level, and L3 and L4 are both
> 'under-sized'. L3 can attempt to 'pull up' any tables from L4 that do not
> overlap with the token ranges of the L3 tables. After that, it can choose to
> do some compactions that mix L3 and L4 to pull up data into L3 if it is still
> significantly under-sized.
> From what I can tell, garbagecollect should just re-write tables in place,
> and leave the compaction strategy to deal with any consequences.
> Moving to L0 is a bad idea. In addition to the extra write amplification and
> extreme increase in temporary disk space required, I observed the following:
> A 'nodetool garbagecollect' was placing a lot of pressure on a L0 of a node.
> We stopped it about 20% through the process, and it managed to compact down
> the top couple levels. So we tried to run 'garbagecollect' again, but the
> first tables it chose to operate on were in L1, not the 'leafs' in L5! This
> was because the order of SSTables chosen currently does not consider the
> level, and instead looks purely at the max timestamp in the file. But
> because we moved _very old_ data from L5 into L0 as a result of the prior
> gabagecollect, manytables in L1 and L2 now had very wide ranges between their
> min and max timestamps – essentially some of the oldest and newest data all
> in one table. This breaks the usual structure of an LCS table where the
> oldest data is at the high levels.
>
> I hope that others agree that this is a bug, and deserving of a fix.
> I have a very simple patch for this that I will be creating a PR for soon. 3
> lines for the code change, 70 lines for a new unit test.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]