[
https://issues.apache.org/jira/browse/CASSANDRA-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Philip Thompson updated CASSANDRA-8737:
---------------------------------------
Issue Type: New Feature (was: Bug)
> AdjacentDataCompactionStrategy
> ------------------------------
>
> Key: CASSANDRA-8737
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8737
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Benedict
> Fix For: 3.0
>
>
> In the original ticket for dealing with timeseries data that introduced DTCS,
> the first suggestion was for an approach that compacted adjacent data (by
> clustering columns) together until a single page (or some fixed multiple of
> pages) on average contained only one partition's worth of data. The idea
> would be to compact any sstables that overlap their clustering components, so
> that only one (or a fixed number) of sstables need to be queried for any
> clustering range. The upshot of this would be tunable compaction burden to
> get optimal read behaviour, more explicitly defined than the decay in DTCS.
> The basic idea would be to select boundary clustering prefixes based on the
> current data occupancy within those ranges, falling roughly along the
> boundaries of the existing sstables, but so that any overlapping tail falls
> one side or the other. We then compact all overlapping sstables, and split
> the results into one side or another of the boundary (or across multiple
> boundaries). If there are no historical updates, this gives pretty optimal
> behaviour; we only compact files until we get to our packing threshold (so
> that reads are known to be at the configured efficiency), and then stop. If
> updates to older records appear, they would be compacted into their boundary
> buckets, and left there until we had enough files in a boundary (probably
> following normal STCS rules) that it warranted compaction.
> The benefit is that such historical updates are still accounted for and
> bounded by comparison to DTCS, and the configuration parameters give more
> tunable characteristics, with explicit expectations (i.e. one seek per X
> bytes read in a partition; higher X may imply more compaction, lower more
> merges and seeks on read). It also may permit us some easy optimisations
> further up the stack, since we can guarantee the boundaries of overlap.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)