[ 
https://issues.apache.org/jira/browse/CASSANDRA-16769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-16769:
-----------------------------------------
    Change Category: Performance
         Complexity: Normal
        Component/s: Tool/sstable
      Fix Version/s: 3.11.x
             Status: Open  (was: Triage Needed)

> Add an option to nodetool garbagecollect that collects only a fraction of the 
> data
> ----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16769
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16769
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tool/sstable
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>            Priority: Normal
>             Fix For: 3.11.x
>
>
> nodetool garbagecollect can currently only run across an entire table.
> For a very large table, with many use cases, the most likely tables to be 
> full of 'garbage' are the oldest tables. With both LCS and STCS, the tables 
> with the lowest generation number are, under normal operation, going to have 
> the majority of data that is masked by a tombstone or overwritten.
> In order to make 'nodetool garbagecollect' more useful for such large tables, 
> I propose that we add an option `--oldest-fraction` that takes a floating 
> point value between 0.00 and 1.00, and only runs 'garbagecollect' over the 
> oldest SSTables that cover at least that fraction of data.
> This would mean, for insatnce, that if you ran this with `--oldest-fraction 
> 0.1` every week, that no table would be older than 10 weeks old, and there 
> would exist no data that has been overwritten, TTL'd, or deleted that was 
> originally written more than 10 weeks ago.
> In my use case, the oldest LCS table is about 20 months old if the table 
> operates in steady-state on Cassandra 3.11.x, but only 5% of the data in 
> tables that age has not been overwritten. This breaks some of the performance 
> promise of LCS – if your last level is 50% filled with overwritten data, then 
> your chance of finding data only in that level is significantly less than 
> advertised.
> 'nodetool compact' is extremely expensive, and not conducive to any sort of 
> incremental operation currently. But nodetool garbagecollect run on a 
> fraction of the oldest data would be.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to