[jira] [Comment Edited] (CASSANDRA-15379) Make it possible to flush with a different compression strategy than we compact with

Joey Lynch (Jira) Tue, 21 Apr 2020 16:22:14 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088266#comment-17088266
 ]


Joey Lynch edited comment on CASSANDRA-15379 at 4/21/20, 11:21 PM:
-------------------------------------------------------------------

*Zstd Defaults Benchmark:*
 * Load pattern: 1.2K wps and 1.2k rps at LOCAL_ONE consistency with a  random 
load pattern.
 * Data sizing: ~100 million partitions with 2 rows each of 10 columns, total 
size per partition of about 4 KiB of random data. ~120 GiB per node data size 
(replicated 6 ways)
 * Compaction settings: LCS with size=320MiB, fanout=20
 * Compression: Zstd with 16 KiB block size

I had to tweak some settings to make compaction less of the overall trace (it 
was 50+% or more of the traces) which are hiding the flush behavior. 
Specifically I increased the size of the memtable before flush by increasing 
the {{memtable_cleanup_threshold}} setting from 0.11 to 0.5, which allowed 
flushes to get up to 1.4 GiB, and by setting compaction to defer as long as we 
can before doing the L0 -> L1 transition:
{noformat}
compaction = {'class': 'LeveledCompactionStrategy', 'fanout_size': '20', 
'max_threshold': '128', 'min_threshold': '32', 'sstable_size_in_mb': '320'}
compression = {'chunk_length_in_kb': '16', 'class': 
'org.apache.cassandra.io.compress.ZstdCompressor'}
{noformat}
I would prefer to up fanout_size even more to defer compactions further, but 
with the increase in memtable size and increase in sstable size and fanout I 
was able to reduce the compaction load to where the cluster was stable (pending 
compactions not growing without bound) on both baseline and candidate 

*Zstd Defaults Benchmark Results*:

Candidate flushes were spaced about 4 minutes apart and took about 8 seconds to 
flush 1.4 GiB. Flamegraphs show 50% of on-cpu time in flush writer and ~45 in 
compression. [^15379_candidate_flush_trace.png]

Baseline flushes were spaced about 4 minutes apart and took about 22 seconds to 
flush 1.4 GiB. Flamegraphs show 20% of on-cpu time in flush writer and ~75 in 
compression.  [^15379_baseline_flush_trace.png]

No significant change in coordinator level, replica level latency or system 
metrics. Some latencies were better on candidate some worse. 
[^15379_system_zstd_defaults.png] [^15379_coordinator_zstd_defaults.png] 
[^15379_replica_zstd_defaults.png]

I think the main finding here is that already, with the cheapest zstd level, we 
are running closer to the flush interval than I'd like (if it takes longer to 
flush then the next time we flush, it's bad news bears for the cluster), and 
this is with a relatively small number of writes per second (~400 coordinator 
writes per second per node)

*Next steps:*

I've published a final squashed commit to:
||trunk||
|[657c39d4|https://github.com/jolynch/cassandra/commit/657c39d4aba0888c6db6a46d1b1febf899de9578]|
|[branch|https://github.com/apache/cassandra/compare/trunk...jolynch:CASSANDRA-15379-final]|
|[!https://circleci.com/gh/jolynch/cassandra/tree/CASSANDRA-15379-final.png?circle-token=
 
1102a59698d04899ec971dd36e925928f7b521f5!|https://circleci.com/gh/jolynch/cassandra/tree/CASSANDRA-15379-final]|

There appear to be a lot of failures in java8 runs that I'm pretty sure are 
unrelated to my change (unit tests and in-jvm dtests passed, along with long 
unit tests). I'll look into all the failures and make sure they're unrelated 
(on a related note I'm :( that trunk is so red again).

I am now running a test with Zstd compression set to a block size of 256 KiB 
and level 10, which is how we typically run it in production for write mosty 
read rarely datasets such as trace data (for the significant reduction in disk 
space). 


was (Author: jolynch):
*Defaults Benchmark:*
 * Load pattern: 1.2K wps and 1.2k rps at LOCAL_ONE consistency with a  random 
load pattern.
 * Data sizing: ~100 million partitions with 2 rows each of 10 columns, total 
size per partition of about 4 KiB of random data. ~120 GiB per node data size 
(replicated 6 ways)
 * Compaction settings: LCS with size=320MiB, fanout=20
 * Compression: Zstd with 16 KiB block size

I had to tweak some settings to make compaction less of the overall trace (it 
was 50+% or more of the traces) which are hiding the flush behavior. 
Specifically I increased the size of the memtable before flush by increasing 
the {{memtable_cleanup_threshold}} setting from 0.11 to 0.5, which allowed 
flushes to get up to 1.4 GiB, and by setting compaction to defer as long as we 
can before doing the L0 -> L1 transition:
{noformat}
compaction = {'class': 'LeveledCompactionStrategy', 'fanout_size': '20', 
'max_threshold': '128', 'min_threshold': '32', 'sstable_size_in_mb': '320'}
compression = {'chunk_length_in_kb': '16', 'class': 
'org.apache.cassandra.io.compress.ZstdCompressor'}
{noformat}
I would prefer to up fanout_size even more to defer compactions further, but 
with the increase in memtable size and increase in sstable size and fanout I 
was able to reduce the compaction load to where the cluster was stable (pending 
compactions not growing without bound) on both baseline and candidate 

*Zstd Defaults Benchmark Results*:

Candidate flushes were spaced about 4 minutes apart and took about 8 seconds to 
flush 1.4 GiB. Flamegraphs show 50% of on-cpu time in flush writer and ~45 in 
compression. [^15379_candidate_flush_trace.png]

Baseline flushes were spaced about 4 minutes apart and took about 22 seconds to 
flush 1.4 GiB. Flamegraphs show 20% of on-cpu time in flush writer and ~75 in 
compression.  [^15379_baseline_flush_trace.png]

No significant change in coordinator level, replica level latency or system 
metrics. Some latencies were better on candidate some worse. 
[^15379_system_zstd_defaults.png] [^15379_coordinator_zstd_defaults.png] 
[^15379_replica_zstd_defaults.png]

I think the main finding here is that already, with the cheapest zstd level, we 
are running closer to the flush interval than I'd like (if it takes longer to 
flush then the next time we flush, it's bad news bears for the cluster), and 
this is with a relatively small number of writes per second (~400 coordinator 
writes per second per node)

*Next steps:*

I've published a final squashed commit to:
||trunk||
|[657c39d4|https://github.com/jolynch/cassandra/commit/657c39d4aba0888c6db6a46d1b1febf899de9578]|
|[branch|https://github.com/apache/cassandra/compare/trunk...jolynch:CASSANDRA-15379-final]|
|[!https://circleci.com/gh/jolynch/cassandra/tree/CASSANDRA-15379-final.png?circle-token=
 
1102a59698d04899ec971dd36e925928f7b521f5!|https://circleci.com/gh/jolynch/cassandra/tree/CASSANDRA-15379-final]|

There appear to be a lot of failures in java8 runs that I'm pretty sure are 
unrelated to my change (unit tests and in-jvm dtests passed, along with long 
unit tests). I'll look into all the failures and make sure they're unrelated 
(on a related note I'm :( that trunk is so red again).

I am now running a test with Zstd compression set to a block size of 256 KiB 
and level 10, which is how we typically run it in production for write mosty 
read rarely datasets such as trace data (for the significant reduction in disk 
space). 

> Make it possible to flush with a different compression strategy than we 
> compact with
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15379
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15379
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local/Compaction, Local/Config, Local/Memtable
>            Reporter: Joey Lynch
>            Assignee: Joey Lynch
>            Priority: Normal
>             Fix For: 4.0-alpha
>
>         Attachments: 15379_baseline_flush_trace.png, 
> 15379_candidate_flush_trace.png, 15379_coordinator_defaults.png, 
> 15379_coordinator_zstd_defaults.png, 15379_replica_defaults.png, 
> 15379_replica_zstd_defaults.png, 15379_system_defaults.png, 
> 15379_system_zstd_defaults.png
>
>
> [~josnyder] and I have been testing out CASSANDRA-14482 (Zstd compression) on 
> some of our most dense clusters and have been observing close to 50% 
> reduction in footprint with Zstd on some of our workloads! Unfortunately 
> though we have been running into an issue where the flush might take so long 
> (Zstd is slower to compress than LZ4) that we can actually block the next 
> flush and cause instability.
> Internally we are working around this with a very simple patch which flushes 
> SSTables as the default compression strategy (LZ4) regardless of the table 
> params. This is a simple solution but I think the ideal solution though might 
> be for the flush compression strategy to be configurable separately from the 
> table compression strategy (while defaulting to the same thing). Instead of 
> adding yet another compression option to the yaml (like hints and commitlog) 
> I was thinking of just adding it to the table parameters and then adding a 
> {{default_table_parameters}} yaml option like:
> {noformat}
> # Default table properties to apply on freshly created tables. The currently 
> supported defaults are:
> # * compression       : How are SSTables compressed in general (flush, 
> compaction, etc ...)
> # * flush_compression : How are SSTables compressed as they flush
> # supported
> default_table_parameters:
>   compression:
>     class_name: 'LZ4Compressor'
>     parameters:
>       chunk_length_in_kb: 16
>   flush_compression:
>     class_name: 'LZ4Compressor'
>     parameters:
>       chunk_length_in_kb: 4
> {noformat}
> This would have the nice effect as well of giving our configuration a path 
> forward to providing user specified defaults for table creation (so e.g. if a 
> particular user wanted to use a different default chunk_length_in_kb they can 
> do that).
> So the proposed (~mandatory) scope is:
> * Flush with a faster compression strategy
> I'd like to implement the following at the same time:
> * Per table flush compression configuration
> * Ability to default the table flush and compaction compression in the yaml.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15379) Make it possible to flush with a different compression strategy than we compact with

Reply via email to