[jira] [Commented] (CASSANDRA-12201) Burst Hour Compaction Strategy

Pedro Gordo (JIRA) Mon, 06 Feb 2017 05:54:57 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-12201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15854024#comment-15854024
 ]


Pedro Gordo commented on CASSANDRA-12201:
-----------------------------------------

I was unable to start this last year because as I was about to, I suffered a 
wrist injury which prevented me from working for more than six months. I'm now 
resuming work on this, although I'll still spend a few days getting up to speed 
with C*.

I studied on the data structure for Cassandra 2.0 but from what I know, there 
were significant changes to 3.0, so I'll need to consider now which version 
I'll be working on. Let me know your opinion on this, please.

> Burst Hour Compaction Strategy
> ------------------------------
>
>                 Key: CASSANDRA-12201
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12201
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Compaction
>            Reporter: Pedro Gordo
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> Although it may be subject to changes, for the moment I plan to create a 
> strategy that will revolve around taking advantage of periods of the day 
> where there's less I/O on the cluster. This time of the day will be called 
> “Burst Hour” (BH), and hence the strategy will be named “Burst Hour 
> Compaction Strategy” (BHCS). 
> The following process would be fired during BH:
> 1. Read all the SSTables and detect which partition keys are present in more 
> than a configurable value which I'll call referenced_sstable_limit. This 
> value will be three by default.
> 2. Group all the repeated keys with a reference to the SSTables containing 
> them.
> 3. Calculate the total size of the SSTables which will be merged for the 
> first partition key on the list created in step 2. If the size calculated is 
> bigger than property which I'll call max_sstable_size (also configurable), 
> more than one table will be created in step 4.
> 4. During the merge, the data will be streamed from SSTables up to a point 
> when we have a size close to max_sstable_size. After we reach this point, the 
> stream is paused, and the new SSTable will be closed, becoming immutable. 
> Repeat the streaming process until we've merged all tables for the partition 
> key that we're iterating.
> 5. Cycle through the rest of the collection created in step 2 and remove any 
> SSTables which don't exist anymore because they were merged in step 5. An 
> alternative course of action here would be to, instead of removing the 
> SSTable from the collection, to change its reference to the SSTable(s) which 
> was created in step 5. 
> 6. Repeat from step 3 to step 6 until we traversed the entirety of the 
> collection created in step 2.
> This strategy addresses three issues of the existing compaction strategies:
> - Due to max_sstable_size_limit, there's no need to reserve disc space for a 
> huge compaction, as it can happen on STCS.
> - The number of SSTables that we need to read from to reply to a read query 
> will be consistently maintained at a low level and controllable through the 
> referenced_sstable_limit property. This addresses the scenario of STCS when 
> we might have to read from a lot of SSTables.
> - It removes the dependency of a continuous high I/O of LCS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (CASSANDRA-12201) Burst Hour Compaction Strategy

Reply via email to