[ 
https://issues.apache.org/jira/browse/CASSANDRA-12201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pedro Gordo updated CASSANDRA-12201:
------------------------------------
    Description: 
This strategy motivation revolves around taking advantage of periods of the day 
where there's less I/O on the cluster. This time of the day will be called 
“Burst Hour” (BH), and hence the strategy will be named “Burst Hour Compaction 
Strategy” (BHCS). 
The following process would be fired during BH:

1. Read all the SSTables and detect which partition keys are present in more 
than the compaction minimum threshold value.

2. Gather all the tables that have keys present in other tables, with a minimum 
of replicas equal to the minimum compaction threshold. 

3. Repeat step 2 until the bucket for gathered SSTables reaches the maximum 
compaction threshold (32 by default), or until we've searched all the keys.

4. The compaction per se will be done through by MaxSSTableSizeWriter. The 
compacted tables will have a maximum size equal to the configurable value of 
max_sstable_size. 

The maximum compaction task (nodetool compact command), does exactly the same 
operation as the background compaction task, but differing in that it can be 
triggered outside of the Burst Hour.

This strategy tries to address three issues of the existing compaction 
strategies:
- Due to max_sstable_size_limit, there's no need to reserve disc space for a 
huge compaction.
- The number of SSTables that we need to read from to reply to a read query 
will be consistently maintained at a low level and controllable through the 
referenced_sstable_limit property.
- It removes the dependency of a continuous high I/O.

Possible future improvements:
- Continuously evaluate how many pending compactions we have and I/O status, 
and then based on that, we start (or not) the compaction.
- If during the day, the size for all the SSTables in a family set reaches a 
certain maximum, then background compaction can occur anyway. This maximum 
should be elevated due to the high CPU usage of BHCS.
- Make it possible to set several compaction times intervals, instead of just 
one.

  was:
This strategy motivation revolves around taking advantage of periods of the day 
where there's less I/O on the cluster. This time of the day will be called 
“Burst Hour” (BH), and hence the strategy will be named “Burst Hour Compaction 
Strategy” (BHCS). 
The following process would be fired during BH:

1. Read all the SSTables and detect which partition keys are present in more 
than the compaction minimum threshold value.

2. Gather all the tables that have keys present in other tables, with a minimum 
of replicas equal to the minimum compaction threshold. 

3. Repeat step 2 until the bucket for gathered SSTables reaches the maximum 
compaction threshold (32 by default), or until we've searched all the keys.

4. The compaction per se will be done through by MaxSSTableSizeWriter. The 
compacted tables will have a maximum size equal to the configurable value of 
max_sstable_size. 

The maximum compaction task (nodetool compact command), does exactly the same 
operation as the background compaction task, but differing in that it can be 
triggered outside of the Burst Hour.

This strategy tries to address three issues of the existing compaction 
strategies:
- Due to max_sstable_size_limit, there's no need to reserve disc space for a 
huge compaction.
- The number of SSTables that we need to read from to reply to a read query 
will be consistently maintained at a low level and controllable through the 
referenced_sstable_limit property.
- It removes the dependency of a continuous high I/O.

Possible future improvements:
- Continuously evaluate how many pending compactions we have and I/O status, 
and then based on that, we start (or not) the compaction.
- If during the day, the size for all the SSTables in a family set reaches a 
certain maximum, then background compaction can occur anyway. This maximum 
should be elevated due to the high CPU usage of BHCS.


> Burst Hour Compaction Strategy
> ------------------------------
>
>                 Key: CASSANDRA-12201
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12201
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Compaction
>            Reporter: Pedro Gordo
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> This strategy motivation revolves around taking advantage of periods of the 
> day where there's less I/O on the cluster. This time of the day will be 
> called “Burst Hour” (BH), and hence the strategy will be named “Burst Hour 
> Compaction Strategy” (BHCS). 
> The following process would be fired during BH:
> 1. Read all the SSTables and detect which partition keys are present in more 
> than the compaction minimum threshold value.
> 2. Gather all the tables that have keys present in other tables, with a 
> minimum of replicas equal to the minimum compaction threshold. 
> 3. Repeat step 2 until the bucket for gathered SSTables reaches the maximum 
> compaction threshold (32 by default), or until we've searched all the keys.
> 4. The compaction per se will be done through by MaxSSTableSizeWriter. The 
> compacted tables will have a maximum size equal to the configurable value of 
> max_sstable_size. 
> The maximum compaction task (nodetool compact command), does exactly the same 
> operation as the background compaction task, but differing in that it can be 
> triggered outside of the Burst Hour.
> This strategy tries to address three issues of the existing compaction 
> strategies:
> - Due to max_sstable_size_limit, there's no need to reserve disc space for a 
> huge compaction.
> - The number of SSTables that we need to read from to reply to a read query 
> will be consistently maintained at a low level and controllable through the 
> referenced_sstable_limit property.
> - It removes the dependency of a continuous high I/O.
> Possible future improvements:
> - Continuously evaluate how many pending compactions we have and I/O status, 
> and then based on that, we start (or not) the compaction.
> - If during the day, the size for all the SSTables in a family set reaches a 
> certain maximum, then background compaction can occur anyway. This maximum 
> should be elevated due to the high CPU usage of BHCS.
> - Make it possible to set several compaction times intervals, instead of just 
> one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to