[
https://issues.apache.org/jira/browse/CASSANDRA-12201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15854024#comment-15854024
]
Pedro Gordo commented on CASSANDRA-12201:
-----------------------------------------
I was unable to start this last year because as I was about to, I suffered a
wrist injury which prevented me from working for more than six months. I'm now
resuming work on this, although I'll still spend a few days getting up to speed
with C*.
I studied on the data structure for Cassandra 2.0 but from what I know, there
were significant changes to 3.0, so I'll need to consider now which version
I'll be working on. Let me know your opinion on this, please.
> Burst Hour Compaction Strategy
> ------------------------------
>
> Key: CASSANDRA-12201
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12201
> Project: Cassandra
> Issue Type: New Feature
> Components: Compaction
> Reporter: Pedro Gordo
> Original Estimate: 1,008h
> Remaining Estimate: 1,008h
>
> Although it may be subject to changes, for the moment I plan to create a
> strategy that will revolve around taking advantage of periods of the day
> where there's less I/O on the cluster. This time of the day will be called
> “Burst Hour” (BH), and hence the strategy will be named “Burst Hour
> Compaction Strategy” (BHCS).
> The following process would be fired during BH:
> 1. Read all the SSTables and detect which partition keys are present in more
> than a configurable value which I'll call referenced_sstable_limit. This
> value will be three by default.
> 2. Group all the repeated keys with a reference to the SSTables containing
> them.
> 3. Calculate the total size of the SSTables which will be merged for the
> first partition key on the list created in step 2. If the size calculated is
> bigger than property which I'll call max_sstable_size (also configurable),
> more than one table will be created in step 4.
> 4. During the merge, the data will be streamed from SSTables up to a point
> when we have a size close to max_sstable_size. After we reach this point, the
> stream is paused, and the new SSTable will be closed, becoming immutable.
> Repeat the streaming process until we've merged all tables for the partition
> key that we're iterating.
> 5. Cycle through the rest of the collection created in step 2 and remove any
> SSTables which don't exist anymore because they were merged in step 5. An
> alternative course of action here would be to, instead of removing the
> SSTable from the collection, to change its reference to the SSTable(s) which
> was created in step 5.
> 6. Repeat from step 3 to step 6 until we traversed the entirety of the
> collection created in step 2.
> This strategy addresses three issues of the existing compaction strategies:
> - Due to max_sstable_size_limit, there's no need to reserve disc space for a
> huge compaction, as it can happen on STCS.
> - The number of SSTables that we need to read from to reply to a read query
> will be consistently maintained at a low level and controllable through the
> referenced_sstable_limit property. This addresses the scenario of STCS when
> we might have to read from a lot of SSTables.
> - It removes the dependency of a continuous high I/O of LCS.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)