[ https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16656028#comment-16656028 ]
Ariel Weisberg commented on CASSANDRA-13241: -------------------------------------------- For those who were asking about the performance impact of block size on compression I wrote a microbenchmark. https://pastebin.com/RHDNLGdC [java] Benchmark Mode Cnt Score Error Units [java] CompactIntegerSequenceBench.benchCompressLZ4Fast16k thrpt 15 331190055.685 ± 8079758.044 ops/s [java] CompactIntegerSequenceBench.benchCompressLZ4Fast32k thrpt 15 353024925.655 ± 7980400.003 ops/s [java] CompactIntegerSequenceBench.benchCompressLZ4Fast64k thrpt 15 365664477.654 ± 10083336.038 ops/s [java] CompactIntegerSequenceBench.benchCompressLZ4Fast8k thrpt 15 305518114.172 ± 11043705.883 ops/s [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast16k thrpt 15 688369529.911 ± 25620873.933 ops/s [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast32k thrpt 15 703635848.895 ± 5296941.704 ops/s [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast64k thrpt 15 695537044.676 ± 17400763.731 ops/s [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast8k thrpt 15 727725713.128 ± 4252436.331 ops/s To summarize, compression is 8.5% slower and decompression is 1% faster. This is measuring the impact on compression/decompression not the huge impact that would occur if we decompressed data we don't need less often. I didn't test decompression of Snappy and LZ4 high, but I did test compression. Snappy: [java] CompactIntegerSequenceBench.benchCompressSnappy16k thrpt 2 196574766.116 ops/s [java] CompactIntegerSequenceBench.benchCompressSnappy32k thrpt 2 198538643.844 ops/s [java] CompactIntegerSequenceBench.benchCompressSnappy64k thrpt 2 194600497.613 ops/s [java] CompactIntegerSequenceBench.benchCompressSnappy8k thrpt 2 186040175.059 ops/s LZ4 high compressor: [java] CompactIntegerSequenceBench.bench16k thrpt 2 20822947.578 ops/s [java] CompactIntegerSequenceBench.bench32k thrpt 2 12037342.253 ops/s [java] CompactIntegerSequenceBench.bench64k thrpt 2 6782534.469 ops/s [java] CompactIntegerSequenceBench.bench8k thrpt 2 32254619.594 ops/s LZ4 high is the one instance where block size mattered a lot. It's a bit suspicious really when you look at the ratio of performance to block size being close to 1:1. I couldn't spot a bug in the benchmark though. Compression ratios with LZ4 fast for the text of Alice in Wonderland was: Chunk size 8192, ratio 0.709473 Chunk size 16384, ratio 0.667236 Chunk size 32768, ratio 0.634735 Chunk size 65536, ratio 0.607208 By way of comparison I also ran deflate with maximum compression: Chunk size 8192, ratio 0.426434 Chunk size 16384, ratio 0.402423 Chunk size 32768, ratio 0.381627 Chunk size 65536, ratio 0.364865 > Lower default chunk_length_in_kb from 64kb to 4kb > ------------------------------------------------- > > Key: CASSANDRA-13241 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13241 > Project: Cassandra > Issue Type: Wish > Components: Core > Reporter: Benjamin Roth > Assignee: Ariel Weisberg > Priority: Major > Attachments: CompactIntegerSequence.java, > CompactIntegerSequenceBench.java, CompactSummingIntegerSequence.java > > > Having a too low chunk size may result in some wasted disk space. A too high > chunk size may lead to massive overreads and may have a critical impact on > overall system performance. > In my case, the default chunk size lead to peak read IOs of up to 1GB/s and > avg reads of 200MB/s. After lowering chunksize (of course aligned with read > ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s. > The risk of (physical) overreads is increasing with lower (page cache size) / > (total data size) ratio. > High chunk sizes are mostly appropriate for bigger payloads pre request but > if the model consists rather of small rows or small resultsets, the read > overhead with 64kb chunk size is insanely high. This applies for example for > (small) skinny rows. > Please also see here: > https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY > To give you some insights what a difference it can make (460GB data, 128GB > RAM): > - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L > - Disk throughput: https://cl.ly/2a0Z250S1M3c > - This shows, that the request distribution remained the same, so no "dynamic > snitch magic": https://cl.ly/3E0t1T1z2c0J -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org