[ 
https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Björn Hegerfors updated CASSANDRA-6602:
---------------------------------------

    Attachment: cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt

> Compaction improvements to optimize time series data
> ----------------------------------------------------
>
>                 Key: CASSANDRA-6602
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Tupshin Harper
>            Assignee: Marcus Eriksson
>              Labels: compaction, performance
>             Fix For: 3.0
>
>         Attachments: 
> cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, 
> cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt
>
>
> There are some unique characteristics of many/most time series use cases that 
> both provide challenges, as well as provide unique opportunities for 
> optimizations.
> One of the major challenges is in compaction. The existing compaction 
> strategies will tend to re-compact data on disk at least a few times over the 
> lifespan of each data point, greatly increasing the cpu and IO costs of that 
> write.
> Compaction exists to
> 1) ensure that there aren't too many files on disk
> 2) ensure that data that should be contiguous (part of the same partition) is 
> laid out contiguously
> 3) deleting data due to ttls or tombstones
> The special characteristics of time series data allow us to optimize away all 
> three.
> Time series data
> 1) tends to be delivered in time order, with relatively constrained exceptions
> 2) often has a pre-determined and fixed expiration date
> 3) Never gets deleted prior to TTL
> 4) Has relatively predictable ingestion rates
> Note that I filed CASSANDRA-5561 and this ticket potentially replaces or 
> lowers the need for it. In that ticket, jbellis reasonably asks, how that 
> compaction strategy is better than disabling compaction.
> Taking that to heart, here is a compaction-strategy-less approach that could 
> be extremely efficient for time-series use cases that follow the above 
> pattern.
> (For context, I'm thinking of an example use case involving lots of streams 
> of time-series data with a 5GB per day ingestion rate, and a 1000 day 
> retention with TTL, resulting in an eventual steady state of 5TB per node)
> 1) You have an extremely large memtable (preferably off heap, if/when doable) 
> for the table, and that memtable is sized to be able to hold a lengthy window 
> of time. A typical period might be one day. At the end of that period, you 
> flush the contents of the memtable to an sstable and move to the next one. 
> This is basically identical to current behaviour, but with thresholds 
> adjusted so that you can ensure flushing at predictable intervals. (Open 
> question is whether predictable intervals is actually necessary, or whether 
> just waiting until the huge memtable is nearly full is sufficient)
> 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be 
> efficiently dropped once all of the columns have. (Another side note, it 
> might be valuable to have a modified version of CASSANDRA-3974 that doesn't 
> bother storing per-column TTL since it is required that all columns have the 
> same TTL)
> 3) Be able to mark column families as read/write only (no explicit deletes), 
> so no tombstones.
> 4) Optionally add back an additional type of delete that would delete all 
> data earlier than a particular timestamp, resulting in immediate dropping of 
> obsoleted sstables.
> The result is that for in-order delivered data, Every cell will be laid out 
> optimally on disk on the first pass, and over the course of 1000 days and 5TB 
> of data, there will "only" be 1000 5GB sstables, so the number of filehandles 
> will be reasonable.
> For exceptions (out-of-order delivery), most cases will be caught by the 
> extended (24 hour+) memtable flush times and merged correctly automatically. 
> For those that were slightly askew at flush time, or were delivered so far 
> out of order that they go in the wrong sstable, there is relatively low 
> overhead to reading from two sstables for a time slice, instead of one, and 
> that overhead would be incurred relatively rarely unless out-of-order 
> delivery was the common case, in which case, this strategy should not be used.
> Another possible optimization to address out-of-order would be to maintain 
> more than one time-centric memtables in memory at a time (e.g. two 12 hour 
> ones), and then you always insert into whichever one of the two "owns" the 
> appropriate range of time. By delaying flushing the ahead one until we are 
> ready to roll writes over to a third one, we are able to avoid any 
> fragmentation as long as all deliveries come in no more than 12 hours late 
> (in this example, presumably tunable).
> Anything that triggers compactions will have to be looked at, since there 
> won't be any. The one concern I have is the ramificaiton of repair. 
> Initially, at least, I think it would be acceptable to just write one sstable 
> per repair and not bother trying to merge it with other sstables.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to