[
https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158394#comment-14158394
]
Björn Hegerfors commented on CASSANDRA-6602:
--------------------------------------------
Yes, forcing all buckets but the current one to always contain a single SSTable
(unless compactions fall behind) sounds good. It sort of raises the question of
how multiple SSTables can end up in the same non-current target anyway. Like
repairs and hinted handoff. Should we still always compact? We wouldn't want to
compact a huge SSTable with some small new chunk too often. I suppose that's
unlikely to happen (maybe never?) as the biggest SSTables are the oldest, and
not much will be going on with them.
Then you raise an idea that I haven't thought about before. You mention doing
STCS in the current bucket. That's not what DTCS does now. Rather, it just
compacts as soon as min_compaction_threshold SSTables are in there, regardless
of any other criteria. In a sense, the time_unit option mirrors the behavior of
the min_sstable_size option in STCS, where anything smaller than that goes into
the same bucket regardless of actual size similarity. Of course, this means
that time_unit has to be chosen properly. I'd say you can't really go wrong
with too small, but too large might be bad. I believe that you should pick
time_unit by the same criteria that you would pick min_sstable_size. If chosen
properly, you won't really have a reason to do STCS in the current bucket. It's
hard to set a default for time_unit, since it depends very much on how fast
data comes in. Essentially, you can view time_unit as the age below which
SSTables get compacted eagerly. Maybe there's a better name for it, like the
obvious min_sstable_age?
> Compaction improvements to optimize time series data
> ----------------------------------------------------
>
> Key: CASSANDRA-6602
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6602
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Tupshin Harper
> Assignee: Björn Hegerfors
> Labels: compaction, performance
> Fix For: 3.0
>
> Attachments: 1 week.txt, 8 weeks.txt, STCS 16 hours.txt,
> TimestampViewer.java,
> cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt,
> cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt,
> cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt
>
>
> There are some unique characteristics of many/most time series use cases that
> both provide challenges, as well as provide unique opportunities for
> optimizations.
> One of the major challenges is in compaction. The existing compaction
> strategies will tend to re-compact data on disk at least a few times over the
> lifespan of each data point, greatly increasing the cpu and IO costs of that
> write.
> Compaction exists to
> 1) ensure that there aren't too many files on disk
> 2) ensure that data that should be contiguous (part of the same partition) is
> laid out contiguously
> 3) deleting data due to ttls or tombstones
> The special characteristics of time series data allow us to optimize away all
> three.
> Time series data
> 1) tends to be delivered in time order, with relatively constrained exceptions
> 2) often has a pre-determined and fixed expiration date
> 3) Never gets deleted prior to TTL
> 4) Has relatively predictable ingestion rates
> Note that I filed CASSANDRA-5561 and this ticket potentially replaces or
> lowers the need for it. In that ticket, jbellis reasonably asks, how that
> compaction strategy is better than disabling compaction.
> Taking that to heart, here is a compaction-strategy-less approach that could
> be extremely efficient for time-series use cases that follow the above
> pattern.
> (For context, I'm thinking of an example use case involving lots of streams
> of time-series data with a 5GB per day ingestion rate, and a 1000 day
> retention with TTL, resulting in an eventual steady state of 5TB per node)
> 1) You have an extremely large memtable (preferably off heap, if/when doable)
> for the table, and that memtable is sized to be able to hold a lengthy window
> of time. A typical period might be one day. At the end of that period, you
> flush the contents of the memtable to an sstable and move to the next one.
> This is basically identical to current behaviour, but with thresholds
> adjusted so that you can ensure flushing at predictable intervals. (Open
> question is whether predictable intervals is actually necessary, or whether
> just waiting until the huge memtable is nearly full is sufficient)
> 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be
> efficiently dropped once all of the columns have. (Another side note, it
> might be valuable to have a modified version of CASSANDRA-3974 that doesn't
> bother storing per-column TTL since it is required that all columns have the
> same TTL)
> 3) Be able to mark column families as read/write only (no explicit deletes),
> so no tombstones.
> 4) Optionally add back an additional type of delete that would delete all
> data earlier than a particular timestamp, resulting in immediate dropping of
> obsoleted sstables.
> The result is that for in-order delivered data, Every cell will be laid out
> optimally on disk on the first pass, and over the course of 1000 days and 5TB
> of data, there will "only" be 1000 5GB sstables, so the number of filehandles
> will be reasonable.
> For exceptions (out-of-order delivery), most cases will be caught by the
> extended (24 hour+) memtable flush times and merged correctly automatically.
> For those that were slightly askew at flush time, or were delivered so far
> out of order that they go in the wrong sstable, there is relatively low
> overhead to reading from two sstables for a time slice, instead of one, and
> that overhead would be incurred relatively rarely unless out-of-order
> delivery was the common case, in which case, this strategy should not be used.
> Another possible optimization to address out-of-order would be to maintain
> more than one time-centric memtables in memory at a time (e.g. two 12 hour
> ones), and then you always insert into whichever one of the two "owns" the
> appropriate range of time. By delaying flushing the ahead one until we are
> ready to roll writes over to a third one, we are able to avoid any
> fragmentation as long as all deliveries come in no more than 12 hours late
> (in this example, presumably tunable).
> Anything that triggers compactions will have to be looked at, since there
> won't be any. The one concern I have is the ramificaiton of repair.
> Initially, at least, I think it would be acceptable to just write one sstable
> per repair and not bother trying to merge it with other sstables.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)