[ 
https://issues.apache.org/jira/browse/CASSANDRA-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

C. Scott Andreas updated CASSANDRA-8737:
----------------------------------------
    Component/s: Compaction

> AdjacentDataCompactionStrategy
> ------------------------------
>
>                 Key: CASSANDRA-8737
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8737
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Compaction
>            Reporter: Benedict
>            Priority: Major
>             Fix For: 4.x
>
>
> In the original ticket for dealing with timeseries data that introduced DTCS, 
> the first suggestion was for an approach that compacted adjacent data (by 
> clustering columns) together until a single page (or some fixed multiple of 
> pages) on average contained only one partition's worth of data. The idea 
> would be to compact any sstables that overlap their clustering components, so 
> that only one (or a fixed number) of sstables need to be queried for any 
> clustering range. The upshot of this would be tunable compaction burden to 
> get optimal read behaviour, more explicitly defined than the decay in DTCS. 
> The basic idea would be to select boundary clustering prefixes based on the 
> current data occupancy within those ranges, falling roughly along the 
> boundaries of the existing sstables, but so that any overlapping tail falls 
> one side or the other. We then compact all overlapping sstables, and split 
> the results into one side or another of the boundary (or across multiple 
> boundaries). If there are no historical updates, this gives pretty optimal 
> behaviour; we only compact files until we get to our packing threshold (so 
> that reads are known to be at the configured efficiency), and then stop. If 
> updates to older records appear, they would be compacted into their boundary 
> buckets, and left there until we had enough files in a boundary (probably 
> following normal STCS rules) that it warranted compaction.
> The benefit is that such historical updates are still accounted for and 
> bounded by comparison to DTCS, and the configuration parameters give more 
> tunable characteristics, with explicit expectations (i.e. one seek per X 
> bytes read in a partition; higher X may imply more compaction, lower more 
> merges and seeks on read). It also may permit us some easy optimisations 
> further up the stack, since we can guarantee the boundaries of overlap.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to