[
https://issues.apache.org/jira/browse/CASSANDRA-12591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15456278#comment-15456278
]
Wei Deng edited comment on CASSANDRA-12591 at 9/2/16 10:32 PM:
---------------------------------------------------------------
So I've done some quick initial tests using latest trunk (i.e. C* 3.10) code
just to prove the point whether this is a worthwhile effort. The hardware I'm
using is still not a typical/adequate-enough configuration I'd use for a
production Cassandra deployment (GCE n1-standard-4, with 4 vCPUs, 15GB RAM and
a single 1TB persistent disk that's spindle-based), but I'm already seeing a
positive sign that shows bigger max_sstable_size can be helpful for compaction
throughput.
Based on the initial results (at each max_sstable_size, I did three runs from
scratch; for all runs I set compaction threads to 4, and since there will be no
throttling enforced by compaction-stress the setting would be equivalent to
setting compaction_throughput_mb_per_sec to 0, the initial SSTable files
generated by `compaction-stress write` are using the default 128MB size, which
is inline with the typical flush size I saw on this kind of hardware using
default cassandra.yaml configuration parameters), using 10GB of stress data
generated by the blogpost data model
[here|https://gist.githubusercontent.com/tjake/8995058fed11d9921e31/raw/a9334d1090017bf546d003e271747351a40692ea/blogpost.yaml],
the overall compaction times with 1280MB max_sstable_size are: 7m16.456s,
7m7.225s, 7m9.102s; the overall compaction times with 160MB max_sstable_size
are: 9m16.715s, 9m28.146s, 9m7.192s.
Given these numbers, the average seconds to finish compaction with 1280MB
max_sstable_size is 430.66, and the average seconds to finish compaction with
160MB max_sstable_size is 557.33, which is already a 23% improvement.
The above tests were conducted using the default parameters from
compaction-stress which generates unique partitions for all writes, so it
reflects the worst kind of workload for LCS. Considering this, I also conducted
another set of tests using {{"--partition-count=1000"}} to force
compaction-stress to generate a lot of overwrites for the same partitions.
While keeping everything else to same and adding this
{{"--partition-count=1000"}} parameter, the overall compaction times with
1280MB max_sstable_size are: 4m59.307s, 4m52.002s, 5m0.967s; the overall
compaction times with 160MB max_sstable_size are: 6m11.533s, 6m21.200s,
6m10.904s. These numbers are understandably faster than the "all unique
partition" scenario in the last paragraph, and if you calculate the average
seconds, 1280MB max_sstable_size is 21% faster than 160MB max_sstable_size.
I realize 10GB data is barely enough to test 1280MB sstable size as the data
will only go from L0->L1, so the next run I'm going to use 100GB data size on
this hardware (keeping everything else the same) and see how the numbers
compare.
was (Author: weideng):
So I've done some quick initial tests using latest trunk (i.e. C* 3.10) code
just to prove the point whether this is a worthwhile effort. The hardware I'm
using is still not a typical/adequate-enough configuration I'd use for a
production Cassandra deployment (GCE n1-standard-4, with 4 vCPUs, 15GB RAM and
a single 1TB persistent disk that's spindle-based), but I'm already seeing a
positive sign that shows bigger max_sstable_size can be helpful for compaction
throughput.
Based on the initial results (at each max_sstable_size, I did three runs from
scratch; for all runs I set compaction threads to 4, and since there will be no
throttling enforced by compaction-stress the setting would be equivalent to
setting compaction_throughput_mb_per_sec to 0, the initial SSTable files
generated by `compaction-stress write` are using the default 128MB size, which
is inline with the typical flush size I saw on this kind of hardware using
default cassandra.yaml configuration parameters), using 10GB of stress data
generated by the blogpost data model
[here|https://gist.githubusercontent.com/tjake/8995058fed11d9921e31/raw/a9334d1090017bf546d003e271747351a40692ea/blogpost.yaml],
the overall compaction times with 1280MB max_sstable_size are: 7m16.456s,
7m7.225s, 7m9.102s; the overall compaction times with 160MB max_sstable_size
are: 9m16.715s, 9m28.146s, 9m7.192s.
Given these numbers, the average seconds to finish compaction with 1280MB
max_sstable_size is 430.66, and the average seconds to finish compaction with
160MB max_sstable_size is 557.33, which is already a 23% improvement.
The above tests were conducted using the default parameters from
compaction-stress which generates unique partitions for all writes, so it
reflects the worst kind of workload for LCS. Considering this, I also conducted
another set of tests using "--partition-count=1000" to force compaction-stress
to generate a lot of overwrites for the same partitions. While keeping
everything else to same and adding this "--partition-count=1000" parameter, the
overall compaction times with 1280MB max_sstable_size are: 4m59.307s,
4m52.002s, 5m0.967s; the overall compaction times with 160MB max_sstable_size
are: 6m11.533s, 6m21.200s, 6m10.904s. These numbers are understandably faster
than the "all unique partition" scenario in the last paragraph, and if you
calculate the average seconds, 1280MB max_sstable_size is 21% faster than 160MB
max_sstable_size.
I realize 10GB data is barely enough to test 1280MB sstable size as the data
will only go from L0->L1, so the next run I'm going to use 100GB data size on
this hardware (keeping everything else the same) and see how the numbers
compare.
> Re-evaluate the default 160MB sstable_size_in_mb choice in LCS
> --------------------------------------------------------------
>
> Key: CASSANDRA-12591
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12591
> Project: Cassandra
> Issue Type: Improvement
> Components: Compaction
> Reporter: Wei Deng
> Labels: lcs
>
> There has been some effort from CASSANDRA-5727 in benchmarking and evaluating
> the best max_sstable_size used by LeveledCompactionStrategy, and the
> conclusion derived from that effort was to use 160MB as the most optimal size
> for both throughput (i.e. the time spent on compaction, the smaller the
> better) and the amount of bytes compacted (to avoid write amplification, the
> less the better).
> However, when I read more into that test report (the short
> [comment|https://issues.apache.org/jira/browse/CASSANDRA-5727?focusedCommentId=13722571&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13722571]
> describing the tests), I realized it was conducted on a hardware with the
> following configuration: "a single rackspace node with 2GB of ram." I'm not
> sure if this was an ok hardware configuration for production Cassandra
> deployment at that time (mid-2013), but it is definitely far lower from
> today's hardware standard now.
> Given that we now have compaction-stress which is able to generate SSTables
> based on user defined stress profile with user defined table schema and
> compaction parameters (compatible to cassandra-stress), it would be a useful
> effort to relook at this number using a more realistic hardware configuration
> and see if 160MB is still the optimal choice. It might also impact our
> perceived "practical" node density with LCS nodes if it turns out bigger
> max_sstable_size actually works better as it will allow less number of
> SSTables (and hence less level and less write amplification) per node with
> bigger density.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)