[ 
https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999481#comment-13999481
 ] 

Björn Hegerfors edited comment on CASSANDRA-6602 at 5/16/14 9:35 AM:
---------------------------------------------------------------------

As my master's thesis, I've implemented a strategy 
"DateTieredCompationStrategy" (DTCS for short). Marcus got me started on this 
project. The way this is implemented is that it uses a target time span which 
starts a little while back in time and then moves back in time, increasing in 
size as it goes. If an sstable has getMinTimestamp() within the time span, it 
"hits" the target, and all sstables hitting the same target end up in the same 
bucket for compaction.

A thing that I'm not sure about is that I've set the current time (a variable 
"now") by getting the maximum timestamp across all sstables. Should system time 
or latest memtable timestamp (if possible) be used instead? I'm also 
considering subtracting all timestamps by a "start timestamp" because it would 
ensure that the sstable with the oldest timestamps will be the biggest (has to 
do with the integer division used for the targets).

This can of course be tweaked further, but my tests have shown that more than 
99.9% of all timestamps can only be found in one sstable (and there would be 
ways to make that 100%). Under the assumption that writes come in uniformly, 
the time spent compacting and the number of sstables on disk should be similar 
to STCS. In my tests, the amount of sstables has tended to be a small constant 
factor higher with DTCS. These things can be tuned to some extent by modifying 
minCompactionThreshold and a DTCS-specific option called timeUnit. I also added 
a maxSSTableAge option to stop compaction of old sstables.

DTCS is just a small step, and this ticket discusses a number of great features 
that would make things even better. Hopefully this gives you guys some new 
ideas and also DTCS might be of use a lot sooner than version 3.0. Tell me what 
you think!

EDIT: I made a few changes. One was a silly thing where I had set the default 
options thinking that timestamps were in microseconds rather than milliseconds. 
The other change was removing the "- 1" from getInitialTarget. This changes the 
meaning of the time_unit option to something that's probably more useful and 
makes more sense. Before this, time_unit roughly meant "time_unit is the time 
that needs to pass before an sstable is eligible for compaction". I did this to 
prevent the newest sstables from getting compacted repeatedly unnecessarily 
before all sstables of the current time unit had come in.

But now I realize that that's probably the opposite of what it should be doing. 
With this change it acts very much like an analogue to the min_sstable_size 
option in STCS. All sstables with a minTimestamp in the current timeUnit 
(sstable.getMinTimestamp() / timeUnit == now / timeUnit) keep getting compacted 
whenever there's at least minCompactionThreshold of them. The newest and 
therefore smallest sstables are after all the cheapest to compact. This should 
keep the sstable numbers down and prevent fragmentation among young sstables. I 
assume this is the better way to go?


was (Author: bj0rn):
As my master's thesis, I've implemented a strategy 
"DateTieredCompationStrategy" (DTCS for short). Marcus got me started on this 
project. The way this is implemented is that it uses a target time span which 
starts a little while back in time and then moves back in time, increasing in 
size as it goes. If an sstable has getMinTimestamp() within the time span, it 
"hits" the target, and all sstables hitting the same target end up in the same 
bucket for compaction.

A thing that I'm not sure about is that I've set the current time (a variable 
"now") by getting the maximum timestamp across all sstables. Should system time 
or latest memtable timestamp (if possible) be used instead? I'm also 
considering subtracting all timestamps by a "start timestamp" because it would 
ensure that the sstable with the oldest timestamps will be the biggest (has to 
do with the integer division used for the targets).

This can of course be tweaked further, but my tests have shown that more than 
99.9% of all timestamps can only be found in one sstable (and there would be 
ways to make that 100%). Under the assumption that writes come in uniformly, 
the time spent compacting and the number of sstables on disk should be similar 
to STCS. In my tests, the amount of sstables has tended to be a small constant 
factor higher with DTCS. These things can be tuned to some extent by modifying 
minCompactionThreshold and a DTCS-specific option called timeUnit. I also added 
a maxSSTableAge option to stop compaction of old sstables.

DTCS is just a small step, and this ticket discusses a number of great features 
that would make things even better. Hopefully this gives you guys some new 
ideas and also DTCS might be of use a lot sooner than version 3.0. Tell me what 
you think!

> Compaction improvements to optimize time series data
> ----------------------------------------------------
>
>                 Key: CASSANDRA-6602
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Tupshin Harper
>            Assignee: Marcus Eriksson
>              Labels: compaction, performance
>             Fix For: 3.0
>
>         Attachments: 
> cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, 
> cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt
>
>
> There are some unique characteristics of many/most time series use cases that 
> both provide challenges, as well as provide unique opportunities for 
> optimizations.
> One of the major challenges is in compaction. The existing compaction 
> strategies will tend to re-compact data on disk at least a few times over the 
> lifespan of each data point, greatly increasing the cpu and IO costs of that 
> write.
> Compaction exists to
> 1) ensure that there aren't too many files on disk
> 2) ensure that data that should be contiguous (part of the same partition) is 
> laid out contiguously
> 3) deleting data due to ttls or tombstones
> The special characteristics of time series data allow us to optimize away all 
> three.
> Time series data
> 1) tends to be delivered in time order, with relatively constrained exceptions
> 2) often has a pre-determined and fixed expiration date
> 3) Never gets deleted prior to TTL
> 4) Has relatively predictable ingestion rates
> Note that I filed CASSANDRA-5561 and this ticket potentially replaces or 
> lowers the need for it. In that ticket, jbellis reasonably asks, how that 
> compaction strategy is better than disabling compaction.
> Taking that to heart, here is a compaction-strategy-less approach that could 
> be extremely efficient for time-series use cases that follow the above 
> pattern.
> (For context, I'm thinking of an example use case involving lots of streams 
> of time-series data with a 5GB per day ingestion rate, and a 1000 day 
> retention with TTL, resulting in an eventual steady state of 5TB per node)
> 1) You have an extremely large memtable (preferably off heap, if/when doable) 
> for the table, and that memtable is sized to be able to hold a lengthy window 
> of time. A typical period might be one day. At the end of that period, you 
> flush the contents of the memtable to an sstable and move to the next one. 
> This is basically identical to current behaviour, but with thresholds 
> adjusted so that you can ensure flushing at predictable intervals. (Open 
> question is whether predictable intervals is actually necessary, or whether 
> just waiting until the huge memtable is nearly full is sufficient)
> 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be 
> efficiently dropped once all of the columns have. (Another side note, it 
> might be valuable to have a modified version of CASSANDRA-3974 that doesn't 
> bother storing per-column TTL since it is required that all columns have the 
> same TTL)
> 3) Be able to mark column families as read/write only (no explicit deletes), 
> so no tombstones.
> 4) Optionally add back an additional type of delete that would delete all 
> data earlier than a particular timestamp, resulting in immediate dropping of 
> obsoleted sstables.
> The result is that for in-order delivered data, Every cell will be laid out 
> optimally on disk on the first pass, and over the course of 1000 days and 5TB 
> of data, there will "only" be 1000 5GB sstables, so the number of filehandles 
> will be reasonable.
> For exceptions (out-of-order delivery), most cases will be caught by the 
> extended (24 hour+) memtable flush times and merged correctly automatically. 
> For those that were slightly askew at flush time, or were delivered so far 
> out of order that they go in the wrong sstable, there is relatively low 
> overhead to reading from two sstables for a time slice, instead of one, and 
> that overhead would be incurred relatively rarely unless out-of-order 
> delivery was the common case, in which case, this strategy should not be used.
> Another possible optimization to address out-of-order would be to maintain 
> more than one time-centric memtables in memory at a time (e.g. two 12 hour 
> ones), and then you always insert into whichever one of the two "owns" the 
> appropriate range of time. By delaying flushing the ahead one until we are 
> ready to roll writes over to a third one, we are able to avoid any 
> fragmentation as long as all deliveries come in no more than 12 hours late 
> (in this example, presumably tunable).
> Anything that triggers compactions will have to be looked at, since there 
> won't be any. The one concern I have is the ramificaiton of repair. 
> Initially, at least, I think it would be acceptable to just write one sstable 
> per repair and not bother trying to merge it with other sstables.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to