Re: CASSANDRA-13241 lower default chunk_length_in_kb

Joshua McKenzie Fri, 19 Oct 2018 07:17:20 -0700

>
> The predominant phrased used in that thread was 'feature freeze'.


At the risk of hijacking this thread, when are we going to transition from
"no new features, change whatever else you want including refactoring and
changing years-old defaults" to "ok, we think we have something that's
stable, time to start testing"?

Right now, if the community starts aggressively testing 4.0 with all the
changes still in flight, there's likely going to be a lot of wasted effort.
I think the root of the disconnect was that when we discussed "freeze" on
the mailing list, it was in the context of getting everyone engaged in
testing 4.0.

On Fri, Oct 19, 2018 at 9:46 AM Ariel Weisberg <ar...@weisberg.ws> wrote:

> Hi,
>
> I ran some benchmarks on my laptop
>
> https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=16656821&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16656821
>
> For a random read workload, varying chunk size:
> Chunk size      Time
>        64k     25:20
>        64k     25:33
>        32k     20:01
>        16k     19:19
>        16k     19:14
>         8k     16:51
>         4k     15:39
>
> Ariel
> On Thu, Oct 18, 2018, at 2:55 PM, Ariel Weisberg wrote:
> > Hi,
> >
> > For those who were asking about the performance impact of block size on
> > compression I wrote a microbenchmark.
> >
> > https://pastebin.com/RHDNLGdC
> >
> >      [java] Benchmark
>  Mode
> > Cnt          Score          Error  Units
> >      [java] CompactIntegerSequenceBench.benchCompressLZ4Fast16k
> thrpt
> > 15  331190055.685 ±  8079758.044  ops/s
> >      [java] CompactIntegerSequenceBench.benchCompressLZ4Fast32k
> thrpt
> > 15  353024925.655 ±  7980400.003  ops/s
> >      [java] CompactIntegerSequenceBench.benchCompressLZ4Fast64k
> thrpt
> > 15  365664477.654 ± 10083336.038  ops/s
> >      [java] CompactIntegerSequenceBench.benchCompressLZ4Fast8k
>  thrpt
> > 15  305518114.172 ± 11043705.883  ops/s
> >      [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast16k
> thrpt
> > 15  688369529.911 ± 25620873.933  ops/s
> >      [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast32k
> thrpt
> > 15  703635848.895 ±  5296941.704  ops/s
> >      [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast64k
> thrpt
> > 15  695537044.676 ± 17400763.731  ops/s
> >      [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast8k
>  thrpt
> > 15  727725713.128 ±  4252436.331  ops/s
> >
> > To summarize, compression is 8.5% slower and decompression is 1% faster.
> > This is measuring the impact on compression/decompression not the huge
> > impact that would occur if we decompressed data we don't need less
> > often.
> >
> > I didn't test decompression of Snappy and LZ4 high, but I did test
> compression.
> >
> > Snappy:
> >      [java] CompactIntegerSequenceBench.benchCompressSnappy16k   thrpt
>
> > 2  196574766.116          ops/s
> >      [java] CompactIntegerSequenceBench.benchCompressSnappy32k   thrpt
>
> > 2  198538643.844          ops/s
> >      [java] CompactIntegerSequenceBench.benchCompressSnappy64k   thrpt
>
> > 2  194600497.613          ops/s
> >      [java] CompactIntegerSequenceBench.benchCompressSnappy8k    thrpt
>
> > 2  186040175.059          ops/s
> >
> > LZ4 high compressor:
> >      [java] CompactIntegerSequenceBench.bench16k  thrpt    2
> > 20822947.578          ops/s
> >      [java] CompactIntegerSequenceBench.bench32k  thrpt    2
> > 12037342.253          ops/s
> >      [java] CompactIntegerSequenceBench.bench64k  thrpt    2
> > 6782534.469          ops/s
> >      [java] CompactIntegerSequenceBench.bench8k   thrpt    2
> > 32254619.594          ops/s
> >
> > LZ4 high is the one instance where block size mattered a lot. It's a bit
> > suspicious really when you look at the ratio of performance to block
> > size being close to 1:1. I couldn't spot a bug in the benchmark though.
> >
> > Compression ratios with LZ4 fast for the text of Alice in Wonderland was:
> >
> > Chunk size 8192, ratio 0.709473
> > Chunk size 16384, ratio 0.667236
> > Chunk size 32768, ratio 0.634735
> > Chunk size 65536, ratio 0.607208
> >
> > By way of comparison I also ran deflate with maximum compression:
> >
> > Chunk size 8192, ratio 0.426434
> > Chunk size 16384, ratio 0.402423
> > Chunk size 32768, ratio 0.381627
> > Chunk size 65536, ratio 0.364865
> >
> > Ariel
> >
> > On Thu, Oct 18, 2018, at 5:32 AM, Benedict Elliott Smith wrote:
> > > FWIW, I’m not -0, just think that long after the freeze date a change
> > > like this needs a strong mandate from the community.  I think the
> change
> > > is a good one.
> > >
> > >
> > >
> > >
> > >
> > > > On 17 Oct 2018, at 22:09, Ariel Weisberg <ar...@weisberg.ws> wrote:
> > > >
> > > > Hi,
> > > >
> > > > It's really not appreciably slower compared to the decompression we
> are going to do which is going to take several microseconds. Decompression
> is also going to be faster because we are going to do less unnecessary
> decompression and the decompression itself may be faster since it may fit
> in a higher level cache better. I ran a microbenchmark comparing them.
> > > >
> > > >
> https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=16653988&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16653988
> > > >
> > > > Fetching a long from memory:       56 nanoseconds
> > > > Compact integer sequence   :       80 nanoseconds
> > > > Summing integer sequence   :      165 nanoseconds
> > > >
> > > > Currently we have one +1 from Kurt to change the representation and
> possibly a -0 from Benedict. That's not really enough to make an exception
> to the code freeze. If you want it to happen (or not) you need to speak up
> otherwise only the default will change.
> > > >
> > > > Regards,
> > > > Ariel
> > > >
> > > > On Wed, Oct 17, 2018, at 6:40 AM, kurt greaves wrote:
> > > >> I think if we're going to drop it to 16k, we should invest in the
> compact
> > > >> sequencing as well. Just lowering it to 16k will have potentially a
> painful
> > > >> impact on anyone running low memory nodes, but if we can do it
> without the
> > > >> memory impact I don't think there's any reason to wait another major
> > > >> version to implement it.
> > > >>
> > > >> Having said that, we should probably benchmark the two
> representations
> > > >> Ariel has come up with.
> > > >>
> > > >> On Wed, 17 Oct 2018 at 20:17, Alain RODRIGUEZ <arodr...@gmail.com>
> wrote:
> > > >>
> > > >>> +1
> > > >>>
> > > >>> I would guess a lot of C* clusters/tables have this option set to
> the
> > > >>> default value, and not many of them are having the need for
> reading so big
> > > >>> chunks of data.
> > > >>> I believe this will greatly limit disk overreads for a fair amount
> (a big
> > > >>> majority?) of new users. It seems fair enough to change this
> default value,
> > > >>> I also think 4.0 is a nice place to do this.
> > > >>>
> > > >>> Thanks for taking care of this Ariel and for making sure there is a
> > > >>> consensus here as well,
> > > >>>
> > > >>> C*heers,
> > > >>> -----------------------
> > > >>> Alain Rodriguez - al...@thelastpickle.com
> > > >>> France / Spain
> > > >>>
> > > >>> The Last Pickle - Apache Cassandra Consulting
> > > >>> http://www.thelastpickle.com
> > > >>>
> > > >>> Le sam. 13 oct. 2018 à 08:52, Ariel Weisberg <ar...@weisberg.ws>
> a écrit :
> > > >>>
> > > >>>> Hi,
> > > >>>>
> > > >>>> This would only impact new tables, existing tables would get their
> > > >>>> chunk_length_in_kb from the existing schema. It's something we
> record in
> > > >>> a
> > > >>>> system table.
> > > >>>>
> > > >>>> I have an implementation of a compact integer sequence that only
> requires
> > > >>>> 37% of the memory required today. So we could do this with only
> slightly
> > > >>>> more than doubling the memory used. I'll post that to the JIRA
> soon.
> > > >>>>
> > > >>>> Ariel
> > > >>>>
> > > >>>> On Fri, Oct 12, 2018, at 1:56 AM, Jeff Jirsa wrote:
> > > >>>>>
> > > >>>>>
> > > >>>>> I think 16k is a better default, but it should only affect new
> tables.
> > > >>>>> Whoever changes it, please make sure you think about the upgrade
> path.
> > > >>>>>
> > > >>>>>
> > > >>>>>> On Oct 12, 2018, at 2:31 AM, Ben Bromhead <b...@instaclustr.com>
> > > >>> wrote:
> > > >>>>>>
> > > >>>>>> This is something that's bugged me for ages, tbh the
> performance gain
> > > >>>> for
> > > >>>>>> most use cases far outweighs the increase in memory usage and I
> would
> > > >>>> even
> > > >>>>>> be in favor of changing the default now, optimizing the storage
> cost
> > > >>>> later
> > > >>>>>> (if it's found to be worth it).
> > > >>>>>>
> > > >>>>>> For some anecdotal evidence:
> > > >>>>>> 4kb is usually what we end setting it to, 16kb feels more
> reasonable
> > > >>>> given
> > > >>>>>> the memory impact, but what would be the point if practically,
> most
> > > >>>> folks
> > > >>>>>> set it to 4kb anyway?
> > > >>>>>>
> > > >>>>>> Note that chunk_length will largely be dependent on your read
> sizes,
> > > >>>> but 4k
> > > >>>>>> is the floor for most physical devices in terms of ones block
> size.
> > > >>>>>>
> > > >>>>>> +1 for making this change in 4.0 given the small size and the
> large
> > > >>>>>> improvement to new users experience (as long as we are explicit
> in
> > > >>> the
> > > >>>>>> documentation about memory consumption).
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>> On Thu, Oct 11, 2018 at 7:11 PM Ariel Weisberg <
> ar...@weisberg.ws>
> > > >>>> wrote:
> > > >>>>>>>
> > > >>>>>>> Hi,
> > > >>>>>>>
> > > >>>>>>> This is regarding
> > > >>>> https://issues.apache.org/jira/browse/CASSANDRA-13241
> > > >>>>>>>
> > > >>>>>>> This ticket has languished for a while. IMO it's too late in
> 4.0 to
> > > >>>>>>> implement a more memory efficient representation for compressed
> > > >>> chunk
> > > >>>>>>> offsets. However I don't think we should put out another
> release
> > > >>> with
> > > >>>> the
> > > >>>>>>> current 64k default as it's pretty unreasonable.
> > > >>>>>>>
> > > >>>>>>> I propose that we lower the value to 16kb. 4k might never be
> the
> > > >>>> correct
> > > >>>>>>> default anyways as there is a cost to compression and 16k will
> still
> > > >>>> be a
> > > >>>>>>> large improvement.
> > > >>>>>>>
> > > >>>>>>> Benedict and Jon Haddad are both +1 on making this change for
> 4.0.
> > > >>> In
> > > >>>> the
> > > >>>>>>> past there has been some consensus about reducing this value
> > > >>> although
> > > >>>> maybe
> > > >>>>>>> with more memory efficiency.
> > > >>>>>>>
> > > >>>>>>> The napkin math for what this costs is:
> > > >>>>>>> "If you have 1TB of uncompressed data, with 64k chunks that's
> 16M
> > > >>>> chunks
> > > >>>>>>> at 8 bytes each (128MB).
> > > >>>>>>> With 16k chunks, that's 512MB.
> > > >>>>>>> With 4k chunks, it's 2G.
> > > >>>>>>> Per terabyte of data (pre-compression)."
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>
> > > >>>
> https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=15886621&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15886621
> > > >>>>>>>
> > > >>>>>>> By way of comparison memory mapping the files has a similar
> cost per
> > > >>>> 4k
> > > >>>>>>> page of 8 bytes. Multiple mappings makes this more expensive.
> With a
> > > >>>>>>> default of 16kb this would be 4x less expensive than memory
> mapping
> > > >>> a
> > > >>>> file.
> > > >>>>>>> I only mention this to give a sense of the costs we are already
> > > >>>> paying. I
> > > >>>>>>> am not saying they are directly related.
> > > >>>>>>>
> > > >>>>>>> I'll wait a week for discussion and if there is consensus make
> the
> > > >>>> change.
> > > >>>>>>>
> > > >>>>>>> Regards,
> > > >>>>>>> Ariel
> > > >>>>>>>
> > > >>>>>>>
> > > >>>
> ---------------------------------------------------------------------
> > > >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > >>>>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org
> > > >>>>>>>
> > > >>>>>>> --
> > > >>>>>> Ben Bromhead
> > > >>>>>> CTO | Instaclustr <https://www.instaclustr.com/>
> > > >>>>>> +1 650 284 9692
> > > >>>>>> Reliability at Scale
> > > >>>>>> Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer
> > > >>>>>
> > > >>>>>
> ---------------------------------------------------------------------
> > > >>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > >>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org
> > > >>>>>
> > > >>>>
> > > >>>>
> ---------------------------------------------------------------------
> > > >>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > >>>> For additional commands, e-mail: dev-h...@cassandra.apache.org
> > > >>>>
> > > >>>>
> > > >>>
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > > For additional commands, e-mail: dev-h...@cassandra.apache.org
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: dev-h...@cassandra.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>

Re: CASSANDRA-13241 lower default chunk_length_in_kb

Reply via email to