[ 
https://issues.apache.org/jira/browse/CASSANDRA-16634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-16634:
----------------------------------------
    Reviewers: Marcus Eriksson, Marcus Eriksson  (was: Marcus Eriksson)
               Marcus Eriksson, Marcus Eriksson  (was: Marcus Eriksson)
       Status: Review In Progress  (was: Patch Available)

> Garbagecollect should not output all tables to L0 with 
> LeveledCompactionStrategy
> --------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16634
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16634
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Compaction
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>            Priority: Normal
>             Fix For: 3.11.x, 4.0.x
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> nodetool garbagecollect always outputs to L0 with LeveledCompactionStrategy.
> This is awful.  On a large LCS table, this means that at the end of the 
> garbagecollect process, all data is in L0.
>  
> This results in an awful sequence of useless temporary space usage and write 
> amplification:
>  # L0 is repeatedly size-tiered compacted until it doesn't have too many 
> SSTables.  If the original LCS table had 2000 tables... this takes a long time
>  # L0 is compacted to L1 in one to a couple very very large compactions
>  # L1 is compacted to L2, L3 to L4, etc.  Write amplification galore
> Due to the above, 'nodetool garbagecollect' is close to worthless for large 
> LCS tables.  A full compaction is always less write amplification and similar 
> temp disk space required.  The only exception is if you can use 'nodetool 
> garbagecolect' part-way, and then use 'nodetool stop' to cancel it before L0 
> is too large.  In this case if you are lucky, and the order that it chose to 
> process SSTables coincides with tables that have the most  disk space to 
> clear, you might free up enough disk space to succeed in your original goal.
>  
> However, from what I can tell, there is no good reason to move the output to 
> L0.  Leaving the output table in the same SSTableLevel as the source table 
> does not violate any of the LeveledCompactionStrategy placement rules, as the 
> output by definition has a token range equal to or smaller than the source.
> The only drawback is if the size of the output files is significantly smaller 
> than the source, in which case the source level would be under-sized.   But 
> that seems like a problem that LCS has to handle, not garbagecollect.
> LCS could have a "pull up" operation where it does something like the 
> following.   Assume a table has L4 as the max level, and L3 and L4 are both 
> 'under-sized'.  L3 can attempt to 'pull up' any tables from L4 that do not 
> overlap with the token ranges of the L3 tables.  After that, it can choose to 
> do some compactions that mix L3 and L4 to pull up data into L3 if it is still 
> significantly under-sized.
> From what I can tell, garbagecollect should just re-write tables in place, 
> and leave the compaction strategy to deal with any consequences.
> Moving to L0 is a bad idea.  In addition to the extra write amplification and 
> extreme increase in temporary disk space required, I observed the following:
> A 'nodetool garbagecollect' was placing a lot of pressure on a L0 of a node.  
> We stopped it about 20% through the process, and it managed to compact down 
> the top couple levels.  So we tried to run 'garbagecollect' again, but the 
> first tables it chose to operate on were in L1, not the 'leafs' in L5!   This 
> was because the order of SSTables chosen currently does not consider the 
> level, and instead looks purely at the max timestamp in the  file.  But 
> because we moved _very old_ data from L5 into L0 as a result of the prior 
> gabagecollect, manytables in L1 and L2 now had very wide ranges between their 
> min and max timestamps – essentially some of the oldest and newest data all 
> in one table.    This breaks the usual structure of an LCS table where the 
> oldest data is at the high levels.
>  
> I hope that others agree that this is a bug, and deserving of a fix.
> I have a very simple patch for this that I will be creating a PR for soon.  3 
> lines for the code change, 70 lines for a new unit test.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to