[jira] [Commented] (CASSANDRA-10995) Consider disabling sstable compression by default in 3.x

2016-01-26 Thread Jim Witschey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15117606#comment-15117606
 ] 

Jim Witschey commented on CASSANDRA-10995:
--

[~benedict] Thanks for your comments.

bq. But only if the population is very small, since you would need for the data 
to occur multiple times on a single page.

I've been assuming that compression happened per-sstable -- am I wrong about 
that? Is this behavior documented somewhere?

bq. Realistically a dictionary generator should be added, which is not very 
hard, and was on my todo list for a long time. That or a weighted random byte 
generator, that is more likely to produce certain bytes (or byte sequences) 
than others, which would avoid the necessity of a dictionary while providing 
the same benefit.

Good idea. [~tjake]: If it's as simple as Benedict indicates, how soon could 
you put together a basic dictionary generator? If a usable version were on a 
branch somewhere in the next few days, it'd be useful for this benchmark. As 
discussed, though, I can work around it if you aren't able.

> Consider disabling sstable compression by default in 3.x
> 
>
> Key: CASSANDRA-10995
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10995
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Aleksey Yeschenko
>Assignee: Jim Witschey
>
> With the new sstable format introduced in CASSANDRA-8099, it's very likely 
> that enabled sstable compression is no longer the right default option.
> [~slebresne]'s [blog post|http://www.datastax.com/2015/12/storage-engine-30] 
> on the new storage engine has some comparison numbers for 2.2/3.0, with and 
> without compression that show that in many cases compression no longer has a 
> significant effect on sstable sizes - all while sill consuming extra 
> resources for both writes (compression) and reads (decompression).
> We should run a comprehensive set of benchmarks to determine whether or not 
> compression should be switched to 'off' now in 3.x.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10995) Consider disabling sstable compression by default in 3.x

2016-01-26 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15117244#comment-15117244
 ] 

Benedict commented on CASSANDRA-10995:
--

You probably want to us a larger dataset.  I suspect that is all happily 
fitting into RAM.  Turning off compression may yield larger dividends for 
on-disk performance for small rows, since fewer sectors need to be touched

As far as compressible data is concerned, yes, narrowing the population size 
_for each column_ in the yaml will increase compressibility.  But only if the 
population is very small, since you would need for the data to occur multiple 
times on a single page.  Realistically a dictionary generator should be added, 
which is not very hard, and was on my todo list for a long time.  That or a 
weighted random byte generator, that is more likely to produce certain bytes 
(or byte sequences) than others, which would avoid the necessity of a 
dictionary while providing the same benefit.

> Consider disabling sstable compression by default in 3.x
> 
>
> Key: CASSANDRA-10995
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10995
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Aleksey Yeschenko
>Assignee: Jim Witschey
>
> With the new sstable format introduced in CASSANDRA-8099, it's very likely 
> that enabled sstable compression is no longer the right default option.
> [~slebresne]'s [blog post|http://www.datastax.com/2015/12/storage-engine-30] 
> on the new storage engine has some comparison numbers for 2.2/3.0, with and 
> without compression that show that in many cases compression no longer has a 
> significant effect on sstable sizes - all while sill consuming extra 
> resources for both writes (compression) and reads (decompression).
> We should run a comprehensive set of benchmarks to determine whether or not 
> compression should be switched to 'off' now in 3.x.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10995) Consider disabling sstable compression by default in 3.x

2016-01-22 Thread Jim Witschey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15113334#comment-15113334
 ] 

Jim Witschey commented on CASSANDRA-10995:
--

I started with this workload from [~enigmacurry], which he says he uses as a 
go-to starting point:

http://cstar.datastax.com/tests/id/a4963d82-a596-11e5-8573-0256e416528f

I ran the workload with each SSTable compressor and with no compression:

* http://cstar.datastax.com/tests/id/872be204-c073-11e5-b8b1-0256e416528f
* http://cstar.datastax.com/tests/id/3aa9a452-c055-11e5-8c22-0256e416528f
* http://cstar.datastax.com/tests/id/82bcb414-bfb0-11e5-8c22-0256e416528f
* http://cstar.datastax.com/tests/id/0ef49fb8-bf94-11e5-8c22-0256e416528f

Here are stress' summary statistics:

{code}
Write
===
Deflate   LZ4Snappy  no compression
latency 95th percentile 3.5   3.3   3.3 3.4
latency 99th percentile 5.3   4.8   4.7 5.3
latency 99.9th percentile 103.5  86.1  87.786.7
latency max  9357.7 513.3 471.7   397.9
op rate146818.0  226499.0  227101.0227818.0
partition rate 146818.0  226499.0  227101.0227818.0
row rate   146818.0  226499.0  227101.0227818.0
latency mean3.4   2.2   2.2 2.2
latency median  1.5   1.5   1.5 1.5

Read
===
   Deflate   LZ4Snappy  no compression
latency 95th percentile   11.6   4.2   4.5 3.5
latency 99th percentile   27.3   6.4   7.0 5.1
latency 99.9th percentile 56.3  48.5  49.148.4
latency max  363.6 403.1 385.9   469.0
op rate6.0  204419.0  197231.0229806.0
partition rate 6.0  204419.0  197231.0229806.0
row rate   6.0  204419.0  197231.0229806.0
latency mean   7.1   2.4   2.5 2.1
latency median 6.1   1.8   1.9 1.6

Mixed Read/Write
===
   Deflate   LZ4Snappy  no compression
latency 95th percentile   12.0   4.9   5.1 3.5
latency 99th percentile   25.2   9.4   9.0 5.0
latency 99.9th percentile 61.5  59.2  58.857.6
latency max  261.87436.96741.0  3443.8
op rate76038.0  181384.0  177463.0217650.0
partition rate 76038.0  181384.0  177463.0217650.0
row rate   76038.0  181384.0  177463.0217650.0
latency mean   6.5   2.7   2.8 2.3
latency median 5.4   1.7   1.8 1.6
{code}

(I generated this chart using the data and iPython notebook posted here: 
https://gist.github.com/mambocab/7bf14e0ff23e0f807f67 for future reference.)

I would want to re-run this a couple times before drawing conclusions, but so 
far, using no compression is at least better than any compression in most read 
metrics.

Any particular requests for follow-up? My thought was:

* at the very least, more runs of this same workload
* probably runs of some of the small workloads we use for daily regressions

I also have access to Windows machines, so if this benchmark is good, I can run 
on that cluster as well.

It may also be worth the time to confirm that turning off compression doesn't 
negatively impact, e.g. MVs, 2Is, and a larger variety of datasets. I'm not 
sure exactly what information we need to make this decision.

> Consider disabling sstable compression by default in 3.x
> 
>
> Key: CASSANDRA-10995
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10995
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Aleksey Yeschenko
>Assignee: Jim Witschey
>
> With the new sstable format introduced in CASSANDRA-8099, it's very likely 
> that enabled sstable compression is no longer the right default option.
> [~slebresne]'s [blog post|http://www.datastax.com/2015/12/storage-engine-30] 
> on the new storage engine has some comparison numbers for 2.2/3.0, with and 
> without compression that show that in many cases compression no longer has a 
> significant effect on sstable sizes - all while sill consuming extra 
> resources for both writes (compression) and reads (decompression).
>

[jira] [Commented] (CASSANDRA-10995) Consider disabling sstable compression by default in 3.x

2016-01-13 Thread Jim Witschey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15096640#comment-15096640
 ] 

Jim Witschey commented on CASSANDRA-10995:
--

[~iamaleksey] Jake had a good suggestion for getting more compressible sstables 
out of {{cassandra-stress}}: decrease the size of the population from which to 
insert. I'm working on determining if data generated like that actually does 
compress more than stress's default randomly-generated data, but if it does, do 
you think that would that be a reasonable proxy for a normal dataset w.r.t. 
compression?

> Consider disabling sstable compression by default in 3.x
> 
>
> Key: CASSANDRA-10995
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10995
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Aleksey Yeschenko
>Assignee: Jim Witschey
>
> With the new sstable format introduced in CASSANDRA-8099, it's very likely 
> that enabled sstable compression is no longer the right default option.
> [~slebresne]'s [blog post|http://www.datastax.com/2015/12/storage-engine-30] 
> on the new storage engine has some comparison numbers for 2.2/3.0, with and 
> without compression that show that in many cases compression no longer has a 
> significant effect on sstable sizes - all while sill consuming extra 
> resources for both writes (compression) and reads (decompression).
> We should run a comprehensive set of benchmarks to determine whether or not 
> compression should be switched to 'off' now in 3.x.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10995) Consider disabling sstable compression by default in 3.x

2016-01-11 Thread Joshua McKenzie (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092859#comment-15092859
 ] 

Joshua McKenzie commented on CASSANDRA-10995:
-

(You're going to love me for this): We need to make sure we benchmark this on 
Windows as well, since the mmap performance on the platform vs. buffered has 
historically had a larger disparity than on linux. That being said, 
uncompressed was always considerably faster when tested on Windows but we do 
want to make sure we cover that platform.

> Consider disabling sstable compression by default in 3.x
> 
>
> Key: CASSANDRA-10995
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10995
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Aleksey Yeschenko
>Assignee: Jim Witschey
>
> With the new sstable format introduced in CASSANDRA-8099, it's very likely 
> that enabled sstable compression is no longer the right default option.
> [~slebresne]'s [blog post|http://www.datastax.com/2015/12/storage-engine-30] 
> on the new storage engine has some comparison numbers for 2.2/3.0, with and 
> without compression that show that in many cases compression no longer has a 
> significant effect on sstable sizes - all while sill consuming extra 
> resources for both writes (compression) and reads (decompression).
> We should run a comprehensive set of benchmarks to determine whether or not 
> compression should be switched to 'off' now in 3.x.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10995) Consider disabling sstable compression by default in 3.x

2016-01-11 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092780#comment-15092780
 ] 

Aleksey Yeschenko commented on CASSANDRA-10995:
---

We don't want to measure just on-disk size. I want to see number for reads even 
more.

To make a decision as big as this, having representative compressible datasets 
(alongside non-compressible ones) is pretty important. But we could start with 
what we can generate easily and go from there.

> Consider disabling sstable compression by default in 3.x
> 
>
> Key: CASSANDRA-10995
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10995
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Aleksey Yeschenko
>Assignee: Jim Witschey
>
> With the new sstable format introduced in CASSANDRA-8099, it's very likely 
> that enabled sstable compression is no longer the right default option.
> [~slebresne]'s [blog post|http://www.datastax.com/2015/12/storage-engine-30] 
> on the new storage engine has some comparison numbers for 2.2/3.0, with and 
> without compression that show that in many cases compression no longer has a 
> significant effect on sstable sizes - all while sill consuming extra 
> resources for both writes (compression) and reads (decompression).
> We should run a comprehensive set of benchmarks to determine whether or not 
> compression should be switched to 'off' now in 3.x.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10995) Consider disabling sstable compression by default in 3.x

2016-01-11 Thread Jim Witschey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092770#comment-15092770
 ] 

Jim Witschey commented on CASSANDRA-10995:
--

One problem we currently have with benchmarking on-disk data size, in 
particular w.r.t. compression, is this: we don't have tools that will generate 
representative, compressible data. It's easy to generate random data 
({{UUID}}s, random strings from {{cassandra-stress}}).

[~iamaleksey] How important is it that we use such a dataset? You'd know better 
than I, but I don't imagine compressibility would effect resource utilization 
other than disk much.

> Consider disabling sstable compression by default in 3.x
> 
>
> Key: CASSANDRA-10995
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10995
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Aleksey Yeschenko
>Assignee: Jim Witschey
>
> With the new sstable format introduced in CASSANDRA-8099, it's very likely 
> that enabled sstable compression is no longer the right default option.
> [~slebresne]'s [blog post|http://www.datastax.com/2015/12/storage-engine-30] 
> on the new storage engine has some comparison numbers for 2.2/3.0, with and 
> without compression that show that in many cases compression no longer has a 
> significant effect on sstable sizes - all while sill consuming extra 
> resources for both writes (compression) and reads (decompression).
> We should run a comprehensive set of benchmarks to determine whether or not 
> compression should be switched to 'off' now in 3.x.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)