[jira] [Comment Edited] (CASSANDRA-12591) Re-evaluate the default 160MB sstable_size_in_mb choice in LCS

Wei Deng (JIRA) Fri, 02 Sep 2016 15:33:34 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-12591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15456278#comment-15456278
 ]


Wei Deng edited comment on CASSANDRA-12591 at 9/2/16 10:32 PM:
---------------------------------------------------------------

So I've done some quick initial tests using latest trunk (i.e. C* 3.10) code 
just to prove the point whether this is a worthwhile effort. The hardware I'm 
using is still not a typical/adequate-enough configuration I'd use for a 
production Cassandra deployment (GCE n1-standard-4, with 4 vCPUs, 15GB RAM and 
a single 1TB persistent disk that's spindle-based), but I'm already seeing a 
positive sign that shows bigger max_sstable_size can be helpful for compaction 
throughput.

Based on the initial results (at each max_sstable_size, I did three runs from 
scratch; for all runs I set compaction threads to 4, and since there will be no 
throttling enforced by compaction-stress the setting would be equivalent to 
setting compaction_throughput_mb_per_sec to 0, the initial SSTable files 
generated by `compaction-stress write` are using the default 128MB size, which 
is inline with the typical flush size I saw on this kind of hardware using 
default cassandra.yaml configuration parameters), using 10GB of stress data 
generated by the blogpost data model 
[here|https://gist.githubusercontent.com/tjake/8995058fed11d9921e31/raw/a9334d1090017bf546d003e271747351a40692ea/blogpost.yaml],
 the overall compaction times with 1280MB max_sstable_size are: 7m16.456s, 
7m7.225s, 7m9.102s; the overall compaction times with 160MB max_sstable_size 
are: 9m16.715s, 9m28.146s, 9m7.192s.

Given these numbers, the average seconds to finish compaction with 1280MB 
max_sstable_size is 430.66, and the average seconds to finish compaction with 
160MB max_sstable_size is 557.33, which is already a 23% improvement.

The above tests were conducted using the default parameters from 
compaction-stress which generates unique partitions for all writes, so it 
reflects the worst kind of workload for LCS. Considering this, I also conducted 
another set of tests using {{"--partition-count=1000"}} to force 
compaction-stress to generate a lot of overwrites for the same partitions. 
While keeping everything else to same and adding this 
{{"--partition-count=1000"}} parameter, the overall compaction times with 
1280MB max_sstable_size are: 4m59.307s, 4m52.002s, 5m0.967s; the overall 
compaction times with 160MB max_sstable_size are: 6m11.533s, 6m21.200s, 
6m10.904s. These numbers are understandably faster than the "all unique 
partition" scenario in the last paragraph, and if you calculate the average 
seconds, 1280MB max_sstable_size is 21% faster than 160MB max_sstable_size.

I realize 10GB data is barely enough to test 1280MB sstable size as the data 
will only go from L0->L1, so the next run I'm going to use 100GB data size on 
this hardware (keeping everything else the same) and see how the numbers 
compare.


was (Author: weideng):
So I've done some quick initial tests using latest trunk (i.e. C* 3.10) code 
just to prove the point whether this is a worthwhile effort. The hardware I'm 
using is still not a typical/adequate-enough configuration I'd use for a 
production Cassandra deployment (GCE n1-standard-4, with 4 vCPUs, 15GB RAM and 
a single 1TB persistent disk that's spindle-based), but I'm already seeing a 
positive sign that shows bigger max_sstable_size can be helpful for compaction 
throughput.

Based on the initial results (at each max_sstable_size, I did three runs from 
scratch; for all runs I set compaction threads to 4, and since there will be no 
throttling enforced by compaction-stress the setting would be equivalent to 
setting compaction_throughput_mb_per_sec to 0, the initial SSTable files 
generated by `compaction-stress write` are using the default 128MB size, which 
is inline with the typical flush size I saw on this kind of hardware using 
default cassandra.yaml configuration parameters), using 10GB of stress data 
generated by the blogpost data model 
[here|https://gist.githubusercontent.com/tjake/8995058fed11d9921e31/raw/a9334d1090017bf546d003e271747351a40692ea/blogpost.yaml],
 the overall compaction times with 1280MB max_sstable_size are: 7m16.456s, 
7m7.225s, 7m9.102s; the overall compaction times with 160MB max_sstable_size 
are: 9m16.715s, 9m28.146s, 9m7.192s.

Given these numbers, the average seconds to finish compaction with 1280MB 
max_sstable_size is 430.66, and the average seconds to finish compaction with 
160MB max_sstable_size is 557.33, which is already a 23% improvement.

The above tests were conducted using the default parameters from 
compaction-stress which generates unique partitions for all writes, so it 
reflects the worst kind of workload for LCS. Considering this, I also conducted 
another set of tests using "--partition-count=1000" to force compaction-stress 
to generate a lot of overwrites for the same partitions. While keeping 
everything else to same and adding this "--partition-count=1000" parameter, the 
overall compaction times with 1280MB max_sstable_size are: 4m59.307s, 
4m52.002s, 5m0.967s; the overall compaction times with 160MB max_sstable_size 
are: 6m11.533s, 6m21.200s, 6m10.904s. These numbers are understandably faster 
than the "all unique partition" scenario in the last paragraph, and if you 
calculate the average seconds, 1280MB max_sstable_size is 21% faster than 160MB 
max_sstable_size.

I realize 10GB data is barely enough to test 1280MB sstable size as the data 
will only go from L0->L1, so the next run I'm going to use 100GB data size on 
this hardware (keeping everything else the same) and see how the numbers 
compare.

> Re-evaluate the default 160MB sstable_size_in_mb choice in LCS
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-12591
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12591
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Compaction
>            Reporter: Wei Deng
>              Labels: lcs
>
> There has been some effort from CASSANDRA-5727 in benchmarking and evaluating 
> the best max_sstable_size used by LeveledCompactionStrategy, and the 
> conclusion derived from that effort was to use 160MB as the most optimal size 
> for both throughput (i.e. the time spent on compaction, the smaller the 
> better) and the amount of bytes compacted (to avoid write amplification, the 
> less the better).
> However, when I read more into that test report (the short 
> [comment|https://issues.apache.org/jira/browse/CASSANDRA-5727?focusedCommentId=13722571&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13722571]
>  describing the tests), I realized it was conducted on a hardware with the 
> following configuration: "a single rackspace node with 2GB of ram." I'm not 
> sure if this was an ok hardware configuration for production Cassandra 
> deployment at that time (mid-2013), but it is definitely far lower from 
> today's hardware standard now.
> Given that we now have compaction-stress which is able to generate SSTables 
> based on user defined stress profile with user defined table schema and 
> compaction parameters (compatible to cassandra-stress), it would be a useful 
> effort to relook at this number using a more realistic hardware configuration 
> and see if 160MB is still the optimal choice. It might also impact our 
> perceived "practical" node density with LCS nodes if it turns out bigger 
> max_sstable_size actually works better as it will allow less number of 
> SSTables (and hence less level and less write amplification) per node with 
> bigger density.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-12591) Re-evaluate the default 160MB sstable_size_in_mb choice in LCS

Reply via email to