[ https://issues.apache.org/jira/browse/CASSANDRA-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
C. Scott Andreas updated CASSANDRA-8737: ---------------------------------------- Component/s: Compaction > AdjacentDataCompactionStrategy > ------------------------------ > > Key: CASSANDRA-8737 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8737 > Project: Cassandra > Issue Type: New Feature > Components: Compaction > Reporter: Benedict > Priority: Major > Fix For: 4.x > > > In the original ticket for dealing with timeseries data that introduced DTCS, > the first suggestion was for an approach that compacted adjacent data (by > clustering columns) together until a single page (or some fixed multiple of > pages) on average contained only one partition's worth of data. The idea > would be to compact any sstables that overlap their clustering components, so > that only one (or a fixed number) of sstables need to be queried for any > clustering range. The upshot of this would be tunable compaction burden to > get optimal read behaviour, more explicitly defined than the decay in DTCS. > The basic idea would be to select boundary clustering prefixes based on the > current data occupancy within those ranges, falling roughly along the > boundaries of the existing sstables, but so that any overlapping tail falls > one side or the other. We then compact all overlapping sstables, and split > the results into one side or another of the boundary (or across multiple > boundaries). If there are no historical updates, this gives pretty optimal > behaviour; we only compact files until we get to our packing threshold (so > that reads are known to be at the configured efficiency), and then stop. If > updates to older records appear, they would be compacted into their boundary > buckets, and left there until we had enough files in a boundary (probably > following normal STCS rules) that it warranted compaction. > The benefit is that such historical updates are still accounted for and > bounded by comparison to DTCS, and the configuration parameters give more > tunable characteristics, with explicit expectations (i.e. one seek per X > bytes read in a partition; higher X may imply more compaction, lower more > merges and seeks on read). It also may permit us some easy optimisations > further up the stack, since we can guarantee the boundaries of overlap. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org