[
https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881663#comment-13881663
]
Tupshin Harper commented on CASSANDRA-6602:
-------------------------------------------
After some side-channel discussion, I think that as long as your approach is
easy to implement, go for it. Actually sounds quite promising.
> Enhancements to optimize for the storing of time series data
> ------------------------------------------------------------
>
> Key: CASSANDRA-6602
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6602
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Tupshin Harper
> Fix For: 3.0
>
>
> There are some unique characteristics of many/most time series use cases that
> both provide challenges, as well as provide unique opportunities for
> optimizations.
> One of the major challenges is in compaction. The existing compaction
> strategies will tend to re-compact data on disk at least a few times over the
> lifespan of each data point, greatly increasing the cpu and IO costs of that
> write.
> Compaction exists to
> 1) ensure that there aren't too many files on disk
> 2) ensure that data that should be contiguous (part of the same partition) is
> laid out contiguously
> 3) deleting data due to ttls or tombstones
> The special characteristics of time series data allow us to optimize away all
> three.
> Time series data
> 1) tends to be delivered in time order, with relatively constrained exceptions
> 2) often has a pre-determined and fixed expiration date
> 3) Never gets deleted prior to TTL
> 4) Has relatively predictable ingestion rates
> Note that I filed CASSANDRA-5561 and this ticket potentially replaces or
> lowers the need for it. In that ticket, jbellis reasonably asks, how that
> compaction strategy is better than disabling compaction.
> Taking that to heart, here is a compaction-strategy-less approach that could
> be extremely efficient for time-series use cases that follow the above
> pattern.
> (For context, I'm thinking of an example use case involving lots of streams
> of time-series data with a 5GB per day ingestion rate, and a 1000 day
> retention with TTL, resulting in an eventual steady state of 5TB per node)
> 1) You have an extremely large memtable (preferably off heap, if/when doable)
> for the table, and that memtable is sized to be able to hold a lengthy window
> of time. A typical period might be one day. At the end of that period, you
> flush the contents of the memtable to an sstable and move to the next one.
> This is basically identical to current behaviour, but with thresholds
> adjusted so that you can ensure flushing at predictable intervals. (Open
> question is whether predictable intervals is actually necessary, or whether
> just waiting until the huge memtable is nearly full is sufficient)
> 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be
> efficiently dropped once all of the columns have. (Another side note, it
> might be valuable to have a modified version of CASSANDRA-3974 that doesn't
> bother storing per-column TTL since it is required that all columns have the
> same TTL)
> 3) Be able to mark column families as read/write only (no explicit deletes),
> so no tombstones.
> 4) Optionally add back an additional type of delete that would delete all
> data earlier than a particular timestamp, resulting in immediate dropping of
> obsoleted sstables.
> The result is that for in-order delivered data, Every cell will be laid out
> optimally on disk on the first pass, and over the course of 1000 days and 5TB
> of data, there will "only" be 1000 5GB sstables, so the number of filehandles
> will be reasonable.
> For exceptions (out-of-order delivery), most cases will be caught by the
> extended (24 hour+) memtable flush times and merged correctly automatically.
> For those that were slightly askew at flush time, or were delivered so far
> out of order that they go in the wrong sstable, there is relatively low
> overhead to reading from two sstables for a time slice, instead of one, and
> that overhead would be incurred relatively rarely unless out-of-order
> delivery was the common case, in which case, this strategy should not be used.
> Another possible optimization to address out-of-order would be to maintain
> more than one time-centric memtables in memory at a time (e.g. two 12 hour
> ones), and then you always insert into whichever one of the two "owns" the
> appropriate range of time. By delaying flushing the ahead one until we are
> ready to roll writes over to a third one, we are able to avoid any
> fragmentation as long as all deliveries come in no more than 12 hours late
> (in this example, presumably tunable).
> Anything that triggers compactions will have to be looked at, since there
> won't be any. The one concern I have is the ramificaiton of repair.
> Initially, at least, I think it would be acceptable to just write one sstable
> per repair and not bother trying to merge it with other sstables.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)