[
https://issues.apache.org/jira/browse/CASSANDRA-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15424092#comment-15424092
]
Marcus Eriksson commented on CASSANDRA-10540:
---------------------------------------------
These "benchmarks" have been run using cassandra-stress with
[this|https://paste.fscking.us/display/jKc0X89MLFzHE9jhRqQ5xfvRHeU] yaml (only
modified per run with the different compaction configurations).
cassandra-stress generates 40GB of data and then it compacts those sstables
using 8 threads. All tests were run with 256 tokens on my machine (2 ssds, 32GB
ram):
{code}
./tools/bin/compaction-stress write -d /var/lib/cassandra -d
/home/marcuse/cassandra -g 40 -p blogpost-range.yaml -t 4 -v 256
./tools/bin/compaction-stress compact -d /var/lib/cassandra -d
/home/marcuse/cassandra -p blogpost-range.yaml -t 8 -v 256
{code}
First a base line - it takes about 7 minutes to compact 40GB of data with STCS,
and we get a write amplification (compaction bytes written / size before) of
about 1.46.
* 40GB + STCS
||size before||size after||compaction bytes written||time||number of
compactions||
|42986704571|31305948786|62268272752|7:44|26|
|43017694284|31717603488|62800073327|7:04|26|
|42863193047|31244649872|64673778727|6:44|26|
|42962733336|31842455113|62985984309|6:14|26|
|43107421526|32526047125|61657717328|6:04|26|
With range aware compaction and a small min_range_sstable_size_in_mb we compact
slower, about 2x the time, but the end result is smaller with a tiny bit smaller
write amplification (1.44). The reason for the longer time is that we need to
do a lot more tiny compaction for each vnode. The reason for the smaller size
after the compactions is that we are much more likely to compact overlapping
sstables together as we compact within each vnode.
* 40GB + STCS + range_aware + min_range_sstable_size_in_mb: 1
||size before||size after||compaction bytes written||time||number of
compactions||
|42944940703|25352795435|61734295478|13:18|286|
|42896304174|25830662102|62049066195|15:45|287|
|43091495756|24811367911|61448601743|12:25|287|
|42961529234|26275106863|63118850488|13:17|284|
|42902111497|25749453764|61529524300|13:54|280|
As we increase the min_range_sstable_size_in_mb the time spent is reduced, the
size after the compaction is increased and the number of compactions is reduced
since we don't promote sstables to the per-vnode-strategies as quickly. With
large enough min_range_sstable_size_in_mb the behaviour will be the same as
STCS (+a small overhead for estimating the size of the next vnode range during
compaction)
* 40GB + STCS + range_aware + min_range_sstable_size_in_mb: 5
||size before||size after||compaction bytes written||time||number of
compactions||
|43071111106|27586259306|62855258024|10:35|172|
* 40GB + STCS + range_aware + min_range_sstable_size_in_mb: 10
||size before||size after||compaction bytes written||time||number of
compactions||
|42998501805|28281735688|65469323764|9:45|109|
* 40GB + STCS + range_aware + min_range_sstable_size_in_mb: 20
||size before||size after||compaction bytes written||time||number of
compactions||
|42801860659|28934194973|66554340039|10:05|48|
* 40GB + STCS + range_aware + min_range_sstable_size_in_mb: 50
||size before||size after||compaction bytes written||time||number of
compactions||
|42881416448|30352758950|61223610818|7:25|27|
With LCS and a small sstable_size_in_mb we get a huge difference with range
aware due to the amount of compactions we need to do to get the leveling
without range aware compaction. With range aware, we get fewer levels in each
vnode-range and that is much quicker to compact. Write amplification is about
2.0 with range aware and 3.4 without.
* 40GB + LCS + sstable_size_in_mb: 10 + range_aware +
min_range_sstable_size_in_mb: 10
||size before||size after||compaction bytes written||time||number of
compactions||
|43170254812|26511935628|87637370434|19:55|903|
|43015904097|26100197485|83125478305|14:45|854|
|43188886684|25651102691|87520409116|19:55|920|
* 40GB + LCS + sstable_size_in_mb: 10
||size before||size after||compaction bytes written||time||number of
compactions||
|43099495889|23876144309|139000531662|28:25|3751|
|42811000078|24620085107|147722973544|30:35|3909|
|42879141849|24479485292|146194679395|30:46|3882|
If we bump the lcs sstable_size_in_mb to the default we get more similar
results. Write amplification is smaller with range aware compaction but size
after is also bigger. The reason for the bigger size after compaction has
settled is that we run with a bigger min_range_sstable_size_in_mb which means
more data will stay out of the per-range compaction strategies and this means
it is only size tiered. This probably also explains the reduced write
amplification - 2.0 with range aware and 2.3 without.
* 40GB + LCS + sstable_size_in_mb: 160 + range_aware +
min_range_sstable_size_in_mb: 20
||size before||size after||compaction bytes written||time||number of
compactions||
|42970784099|27044941599|85933586287|12:55|180|
|42953512565|26229232777|82158863291|11:36|155|
|43028281629|26025950993|86704157660|11:25|177|
* 40GB + LCS + sstable_size_in_mb: 160
||size before||size after||compaction bytes written||time||number of
compactions||
|43120992697|24487560567|100347633105|12:25|151|
|42854926611|24466503628|102492898148|10:55|155|
|42919253642|24831918330|100902215961|12:15|161|
> RangeAwareCompaction
> --------------------
>
> Key: CASSANDRA-10540
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10540
> Project: Cassandra
> Issue Type: New Feature
> Reporter: Marcus Eriksson
> Assignee: Marcus Eriksson
> Labels: compaction, lcs, vnodes
> Fix For: 3.x
>
>
> Broken out from CASSANDRA-6696, we should split sstables based on ranges
> during compaction.
> Requirements;
> * dont create tiny sstables - keep them bunched together until a single vnode
> is big enough (configurable how big that is)
> * make it possible to run existing compaction strategies on the per-range
> sstables
> We should probably add a global compaction strategy parameter that states
> whether this should be enabled or not.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)