RE: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

2017-09-26 Thread Steinmaurer, Thomas
Hi Alex,

we tested with larger new gen sizes up to ¼ of max heap, but m4.xlarge look 
like being to weak to deal with larger new gen. The result was that we then got 
much more GCInspector related logs, but perhaps we need to re-test.

Right, we are using batches extensively. Unlogged/non-atomic. We are aware of 
avoiding multi partition batches, if possible. For test purposes we built 
something into our application to switch a flag to move from multi partition 
batches to strictly single partition per batch. We have not seen any measurable 
high-level improvement (e.g. decreased CPU, GC suspension …) on the 
Cassandra-side with single partition batches. Naturally, this resulted in much 
more requests executed by our application against the Cassandra cluster, with 
the affect in our application/server, that we saw a significant GC/CPU increase 
on our server, caused by the DataStax driver due to executing now more requests 
by a factor of X. So, with no visible gain on the Cassandra-side, but impacting 
our application/server negatively, we don’t strictly execute single partition 
batches.

As said on the ticket (https://issues.apache.org/jira/browse/CASSANDRA-13900), 
anything except Cassandra binaries have been unchanged in our loadtest 
environment.


Thanks,
Thomas



From: Alexander Dejanovski [mailto:a...@thelastpickle.com]
Sent: Dienstag, 26. September 2017 11:14
To: user@cassandra.apache.org
Subject: Re: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

Hi Thomas,

I wouldn't move to G1GC with small heaps (<24GB) but just looking at your 
ticket I think that your new gen is way too small.
I get that it worked better in 2.1 in your case though, which would suggest 
that the memory footprint is different between 2.1 and 3.0. It looks like 
you're using batches extensively.
Hopefully you're aware that multi partition batches are discouraged because 
they indeed create heap pressure and high coordination costs (on top of 
batchlog writes/deletions), leading to more GC pauses.
With a 400MB new gen, you're very likely to have a lot of premature promotions 
(especially with the default max tenuring threshold), which will fill the old 
gen faster than necessary and is likely to trigger major GCs.

I'd suggest you re-run those tests with a 2GB new gen and compare results. Know 
that with Cassandra you can easily go up to 40%-50% of your heap for the new 
gen.

Cheers,


On Tue, Sep 26, 2017 at 10:58 AM Matope Ono 
<matope@gmail.com<mailto:matope@gmail.com>> wrote:
Hi. We met similar situation after upgrading from 2.1.14 to 3.11 in our 
production.

Have you already tried G1GC instead of CMS? Our timeouts were mitigated after 
replacing CMS with G1GC.

Thanks.

2017-09-25 20:01 GMT+09:00 Steinmaurer, Thomas 
<thomas.steinmau...@dynatrace.com<mailto:thomas.steinmau...@dynatrace.com>>:
Hello,

I have now some concrete numbers from our 9 node loadtest cluster with constant 
load, same infrastructure after upgrading to 3.0.14 from 2.1.18.

We see doubled GC suspension time + correlating CPU increase. In short, 3.0.14 
is not able to handle the same load.

I have created https://issues.apache.org/jira/browse/CASSANDRA-13900. Feel free 
to request any further additional information on the ticket.

Unfortunately this is a real show-stopper for us upgrading to 3.0.

Thanks for your attention.

Thomas

From: Steinmaurer, Thomas 
[mailto:thomas.steinmau...@dynatrace.com<mailto:thomas.steinmau...@dynatrace.com>]
Sent: Freitag, 15. September 2017 13:51
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: RE: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

Hi Jeff,

we are using native (CQL3) via Java DataStax driver (3.1). We also have 
OpsCenter running (to be removed soon) via Thrift, if I remember correctly.

As said, the write request latency for our keyspace hasn’t really changed, so 
perhaps another one (system related, OpsCenter …?) is affected or perhaps the 
JMX metric is reporting something differently now. ☺ So not a real issue for 
now hopefully, just popping up in our monitoring, wondering what this may be.

Regarding compression metadata memory usage drop. Right, storage engine 
re-write could be a reason. Thanks.

Still wondering about the GC/CPU increase.

Thanks!

Thomas



From: Jeff Jirsa [mailto:jji...@gmail.com]
Sent: Freitag, 15. September 2017 13:14
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

Most people find 3.0 slightly slower than 2.1. The only thing that really 
stands out in your email is the huge change in 95% latency - that's atypical. 
Are you using thrift or native 9042)?  The decrease in compression metadata 
offheap usage is likely due to the increased storage efficiency of the storage 
engine (see Cassandra-8099).


--
Jeff Jirsa


On Sep 15, 2017, at 2:37 AM, Steinmaurer, Thomas 
<thomas.

RE: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

2017-09-26 Thread Steinmaurer, Thomas
Hi,

in our experience CMS is doing much better with smaller heaps.
Regards,
Thomas


From: Matope Ono [mailto:matope@gmail.com]
Sent: Dienstag, 26. September 2017 10:58
To: user@cassandra.apache.org
Subject: Re: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

Hi. We met similar situation after upgrading from 2.1.14 to 3.11 in our 
production.

Have you already tried G1GC instead of CMS? Our timeouts were mitigated after 
replacing CMS with G1GC.

Thanks.

2017-09-25 20:01 GMT+09:00 Steinmaurer, Thomas 
<thomas.steinmau...@dynatrace.com<mailto:thomas.steinmau...@dynatrace.com>>:
Hello,

I have now some concrete numbers from our 9 node loadtest cluster with constant 
load, same infrastructure after upgrading to 3.0.14 from 2.1.18.

We see doubled GC suspension time + correlating CPU increase. In short, 3.0.14 
is not able to handle the same load.

I have created https://issues.apache.org/jira/browse/CASSANDRA-13900. Feel free 
to request any further additional information on the ticket.

Unfortunately this is a real show-stopper for us upgrading to 3.0.

Thanks for your attention.

Thomas

From: Steinmaurer, Thomas 
[mailto:thomas.steinmau...@dynatrace.com<mailto:thomas.steinmau...@dynatrace.com>]
Sent: Freitag, 15. September 2017 13:51
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: RE: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

Hi Jeff,

we are using native (CQL3) via Java DataStax driver (3.1). We also have 
OpsCenter running (to be removed soon) via Thrift, if I remember correctly.

As said, the write request latency for our keyspace hasn’t really changed, so 
perhaps another one (system related, OpsCenter …?) is affected or perhaps the 
JMX metric is reporting something differently now. ☺ So not a real issue for 
now hopefully, just popping up in our monitoring, wondering what this may be.

Regarding compression metadata memory usage drop. Right, storage engine 
re-write could be a reason. Thanks.

Still wondering about the GC/CPU increase.

Thanks!

Thomas



From: Jeff Jirsa [mailto:jji...@gmail.com]
Sent: Freitag, 15. September 2017 13:14
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

Most people find 3.0 slightly slower than 2.1. The only thing that really 
stands out in your email is the huge change in 95% latency - that's atypical. 
Are you using thrift or native 9042)?  The decrease in compression metadata 
offheap usage is likely due to the increased storage efficiency of the storage 
engine (see Cassandra-8099).


--
Jeff Jirsa


On Sep 15, 2017, at 2:37 AM, Steinmaurer, Thomas 
<thomas.steinmau...@dynatrace.com<mailto:thomas.steinmau...@dynatrace.com>> 
wrote:
Hello,

we have a test (regression) environment hosted in AWS, which is used for auto 
deploying our software on a daily basis and attach constant load across all 
deployments. Basically to allow us to detect any regressions in our software on 
a daily basis.

On the Cassandra-side, this is single-node in AWS, m4.xlarge, EBS gp2, 8G heap, 
CMS. The environment has also been upgraded from Cassandra 2.1.18 to 3.0.14 at 
a certain point in time. Without running upgradesstables so far. We have not 
made any additional JVM/GC configuration change when going from 2.1.18 to 
3.0.14 on our own, thus, any self-made configuration changes (e.g. new gen heap 
size) for 2.1.18 are also in place with 3.0.14.

What we see after a time-frame of ~ 7 days (so, e.g. should not be caused by 
some sort of spiky compaction pattern) is an AVG increase in GC/CPU (most 
likely correlating):

• CPU: ~ 12% => ~ 17%

• GC Suspension: ~ 1,7% => 3,29%

In this environment not a big deal, but relatively we have a CPU increase of ~ 
50% (with increased GC most likely contributing). Something we have deal with 
when going into production (going into larger, multi-node loadtest environments 
first though).

Beside the CPU/GC shift, we also monitor the following noticeable changes 
(don’t know if they somehow correlate with the CPU/GC shift above):

• Increased AVG Write Client Requests Latency (95th Percentile), 
org.apache.cassandra.metrics.ClientRequest.Latency.Write: 6,05ms => 29,2ms, but 
almost constant (no change in) write client request latency for our particular 
keyspace only, org.apache.cassandra.metrics.Keyspace.ruxitdb.WriteLatency

• Compression metadata memory usage drop, 
org.apache.cassandra.metrics.Keyspace.XXX. 
CompressionMetadataOffHeapMemoryUsed: ~218MB => ~105MB => Good or bad? Known?

I know, looks all a bit vague, but perhaps someone else has seen something 
similar when upgrading to 3.0.14 and can share their thoughts/ideas. Especially 
the (relative) CPU/GC increase is something we are curious about.

Thanks a lot.

Thomas
The contents of this e-mail are intended for the named addressee only. It 
contains infor

Re: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

2017-09-26 Thread Alexander Dejanovski
Hi Thomas,

I wouldn't move to G1GC with small heaps (<24GB) but just looking at your
ticket I think that your new gen is way too small.
I get that it worked better in 2.1 in your case though, which would suggest
that the memory footprint is different between 2.1 and 3.0. It looks like
you're using batches extensively.
Hopefully you're aware that multi partition batches are discouraged because
they indeed create heap pressure and high coordination costs (on top of
batchlog writes/deletions), leading to more GC pauses.
With a 400MB new gen, you're very likely to have a lot of premature
promotions (especially with the default max tenuring threshold), which will
fill the old gen faster than necessary and is likely to trigger major GCs.

I'd suggest you re-run those tests with a 2GB new gen and compare results.
Know that with Cassandra you can easily go up to 40%-50% of your heap for
the new gen.

Cheers,


On Tue, Sep 26, 2017 at 10:58 AM Matope Ono <matope@gmail.com> wrote:

> Hi. We met similar situation after upgrading from 2.1.14 to 3.11 in our
> production.
>
> Have you already tried G1GC instead of CMS? Our timeouts were mitigated
> after replacing CMS with G1GC.
>
> Thanks.
>
> 2017-09-25 20:01 GMT+09:00 Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com>:
>
>> Hello,
>>
>>
>>
>> I have now some concrete numbers from our 9 node loadtest cluster with
>> constant load, same infrastructure after upgrading to 3.0.14 from 2.1.18.
>>
>>
>>
>> We see doubled GC suspension time + correlating CPU increase. In short,
>> 3.0.14 is not able to handle the same load.
>>
>>
>>
>> I have created https://issues.apache.org/jira/browse/CASSANDRA-13900.
>> Feel free to request any further additional information on the ticket.
>>
>>
>>
>> Unfortunately this is a real show-stopper for us upgrading to 3.0.
>>
>>
>>
>> Thanks for your attention.
>>
>>
>>
>> Thomas
>>
>>
>>
>> *From:* Steinmaurer, Thomas [mailto:thomas.steinmau...@dynatrace.com]
>> *Sent:* Freitag, 15. September 2017 13:51
>> *To:* user@cassandra.apache.org
>> *Subject:* RE: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)
>>
>>
>>
>> Hi Jeff,
>>
>>
>>
>> we are using native (CQL3) via Java DataStax driver (3.1). We also have
>> OpsCenter running (to be removed soon) via Thrift, if I remember correctly.
>>
>>
>>
>> As said, the write request latency for our keyspace hasn’t really
>> changed, so perhaps another one (system related, OpsCenter …?) is affected
>> or perhaps the JMX metric is reporting something differently now. J So
>> not a real issue for now hopefully, just popping up in our monitoring,
>> wondering what this may be.
>>
>>
>>
>> Regarding compression metadata memory usage drop. Right, storage engine
>> re-write could be a reason. Thanks.
>>
>>
>>
>> Still wondering about the GC/CPU increase.
>>
>>
>>
>> Thanks!
>>
>>
>>
>> Thomas
>>
>>
>>
>>
>>
>>
>>
>> *From:* Jeff Jirsa [mailto:jji...@gmail.com <jji...@gmail.com>]
>> *Sent:* Freitag, 15. September 2017 13:14
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)
>>
>>
>>
>> Most people find 3.0 slightly slower than 2.1. The only thing that really
>> stands out in your email is the huge change in 95% latency - that's
>> atypical. Are you using thrift or native 9042)?  The decrease in
>> compression metadata offheap usage is likely due to the increased storage
>> efficiency of the storage engine (see Cassandra-8099).
>>
>>
>>
>>
>> --
>>
>> Jeff Jirsa
>>
>>
>>
>>
>> On Sep 15, 2017, at 2:37 AM, Steinmaurer, Thomas <
>> thomas.steinmau...@dynatrace.com> wrote:
>>
>> Hello,
>>
>>
>>
>> we have a test (regression) environment hosted in AWS, which is used for
>> auto deploying our software on a daily basis and attach constant load
>> across all deployments. Basically to allow us to detect any regressions in
>> our software on a daily basis.
>>
>>
>>
>> On the Cassandra-side, this is single-node in AWS, m4.xlarge, EBS gp2, 8G
>> heap, CMS. The environment has also been upgraded from Cassandra 2.1.18 to
>> 3.0.14 at a certain point in time. Without running upgradesstables so far.
>> We have not made any additional JVM/GC configuration change when going from
>> 2.1.18 to 3.0.

Re: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

2017-09-26 Thread Matope Ono
Hi. We met similar situation after upgrading from 2.1.14 to 3.11 in our
production.

Have you already tried G1GC instead of CMS? Our timeouts were mitigated
after replacing CMS with G1GC.

Thanks.

2017-09-25 20:01 GMT+09:00 Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com>:

> Hello,
>
>
>
> I have now some concrete numbers from our 9 node loadtest cluster with
> constant load, same infrastructure after upgrading to 3.0.14 from 2.1.18.
>
>
>
> We see doubled GC suspension time + correlating CPU increase. In short,
> 3.0.14 is not able to handle the same load.
>
>
>
> I have created https://issues.apache.org/jira/browse/CASSANDRA-13900.
> Feel free to request any further additional information on the ticket.
>
>
>
> Unfortunately this is a real show-stopper for us upgrading to 3.0.
>
>
>
> Thanks for your attention.
>
>
>
> Thomas
>
>
>
> *From:* Steinmaurer, Thomas [mailto:thomas.steinmau...@dynatrace.com]
> *Sent:* Freitag, 15. September 2017 13:51
> *To:* user@cassandra.apache.org
> *Subject:* RE: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)
>
>
>
> Hi Jeff,
>
>
>
> we are using native (CQL3) via Java DataStax driver (3.1). We also have
> OpsCenter running (to be removed soon) via Thrift, if I remember correctly.
>
>
>
> As said, the write request latency for our keyspace hasn’t really changed,
> so perhaps another one (system related, OpsCenter …?) is affected or
> perhaps the JMX metric is reporting something differently now. J So not a
> real issue for now hopefully, just popping up in our monitoring, wondering
> what this may be.
>
>
>
> Regarding compression metadata memory usage drop. Right, storage engine
> re-write could be a reason. Thanks.
>
>
>
> Still wondering about the GC/CPU increase.
>
>
>
> Thanks!
>
>
>
> Thomas
>
>
>
>
>
>
>
> *From:* Jeff Jirsa [mailto:jji...@gmail.com <jji...@gmail.com>]
> *Sent:* Freitag, 15. September 2017 13:14
> *To:* user@cassandra.apache.org
> *Subject:* Re: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)
>
>
>
> Most people find 3.0 slightly slower than 2.1. The only thing that really
> stands out in your email is the huge change in 95% latency - that's
> atypical. Are you using thrift or native 9042)?  The decrease in
> compression metadata offheap usage is likely due to the increased storage
> efficiency of the storage engine (see Cassandra-8099).
>
>
>
>
> --
>
> Jeff Jirsa
>
>
>
>
> On Sep 15, 2017, at 2:37 AM, Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
> Hello,
>
>
>
> we have a test (regression) environment hosted in AWS, which is used for
> auto deploying our software on a daily basis and attach constant load
> across all deployments. Basically to allow us to detect any regressions in
> our software on a daily basis.
>
>
>
> On the Cassandra-side, this is single-node in AWS, m4.xlarge, EBS gp2, 8G
> heap, CMS. The environment has also been upgraded from Cassandra 2.1.18 to
> 3.0.14 at a certain point in time. Without running upgradesstables so far.
> We have not made any additional JVM/GC configuration change when going from
> 2.1.18 to 3.0.14 on our own, thus, any self-made configuration changes
> (e.g. new gen heap size) for 2.1.18 are also in place with 3.0.14.
>
>
>
> What we see after a time-frame of ~ 7 days (so, e.g. should not be caused
> by some sort of spiky compaction pattern) is an AVG increase in GC/CPU
> (most likely correlating):
>
> · CPU: ~ 12% => ~ 17%
>
> · GC Suspension: ~ 1,7% => 3,29%
>
>
>
> In this environment not a big deal, but relatively we have a CPU increase
> of ~ 50% (with increased GC most likely contributing). Something we have
> deal with when going into production (going into larger, multi-node
> loadtest environments first though).
>
>
>
> Beside the CPU/GC shift, we also monitor the following noticeable changes
> (don’t know if they somehow correlate with the CPU/GC shift above):
>
> · Increased AVG Write Client Requests Latency (95th Percentile),
> org.apache.cassandra.metrics.ClientRequest.Latency.Write: 6,05ms =>
> 29,2ms, but almost constant (no change in) write client request latency for
> our particular keyspace only, org.apache.cassandra.metrics.
> Keyspace.ruxitdb.WriteLatency
>
> · Compression metadata memory usage drop,
> org.apache.cassandra.metrics.Keyspace.XXX. 
> CompressionMetadataOffHeapMemoryUsed:
> ~218MB => ~105MB => Good or bad? Known?
>
>
>
> I know, looks all a bit vague, but perhaps someone else has seen someth

RE: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

2017-09-25 Thread Steinmaurer, Thomas
Hello,

I have now some concrete numbers from our 9 node loadtest cluster with constant 
load, same infrastructure after upgrading to 3.0.14 from 2.1.18.

We see doubled GC suspension time + correlating CPU increase. In short, 3.0.14 
is not able to handle the same load.

I have created https://issues.apache.org/jira/browse/CASSANDRA-13900. Feel free 
to request any further additional information on the ticket.

Unfortunately this is a real show-stopper for us upgrading to 3.0.

Thanks for your attention.

Thomas

From: Steinmaurer, Thomas [mailto:thomas.steinmau...@dynatrace.com]
Sent: Freitag, 15. September 2017 13:51
To: user@cassandra.apache.org
Subject: RE: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

Hi Jeff,

we are using native (CQL3) via Java DataStax driver (3.1). We also have 
OpsCenter running (to be removed soon) via Thrift, if I remember correctly.

As said, the write request latency for our keyspace hasn’t really changed, so 
perhaps another one (system related, OpsCenter …?) is affected or perhaps the 
JMX metric is reporting something differently now. ☺ So not a real issue for 
now hopefully, just popping up in our monitoring, wondering what this may be.

Regarding compression metadata memory usage drop. Right, storage engine 
re-write could be a reason. Thanks.

Still wondering about the GC/CPU increase.

Thanks!

Thomas



From: Jeff Jirsa [mailto:jji...@gmail.com]
Sent: Freitag, 15. September 2017 13:14
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

Most people find 3.0 slightly slower than 2.1. The only thing that really 
stands out in your email is the huge change in 95% latency - that's atypical. 
Are you using thrift or native 9042)?  The decrease in compression metadata 
offheap usage is likely due to the increased storage efficiency of the storage 
engine (see Cassandra-8099).


--
Jeff Jirsa


On Sep 15, 2017, at 2:37 AM, Steinmaurer, Thomas 
<thomas.steinmau...@dynatrace.com<mailto:thomas.steinmau...@dynatrace.com>> 
wrote:
Hello,

we have a test (regression) environment hosted in AWS, which is used for auto 
deploying our software on a daily basis and attach constant load across all 
deployments. Basically to allow us to detect any regressions in our software on 
a daily basis.

On the Cassandra-side, this is single-node in AWS, m4.xlarge, EBS gp2, 8G heap, 
CMS. The environment has also been upgraded from Cassandra 2.1.18 to 3.0.14 at 
a certain point in time. Without running upgradesstables so far. We have not 
made any additional JVM/GC configuration change when going from 2.1.18 to 
3.0.14 on our own, thus, any self-made configuration changes (e.g. new gen heap 
size) for 2.1.18 are also in place with 3.0.14.

What we see after a time-frame of ~ 7 days (so, e.g. should not be caused by 
some sort of spiky compaction pattern) is an AVG increase in GC/CPU (most 
likely correlating):

· CPU: ~ 12% => ~ 17%

· GC Suspension: ~ 1,7% => 3,29%

In this environment not a big deal, but relatively we have a CPU increase of ~ 
50% (with increased GC most likely contributing). Something we have deal with 
when going into production (going into larger, multi-node loadtest environments 
first though).

Beside the CPU/GC shift, we also monitor the following noticeable changes 
(don’t know if they somehow correlate with the CPU/GC shift above):

· Increased AVG Write Client Requests Latency (95th Percentile), 
org.apache.cassandra.metrics.ClientRequest.Latency.Write: 6,05ms => 29,2ms, but 
almost constant (no change in) write client request latency for our particular 
keyspace only, org.apache.cassandra.metrics.Keyspace.ruxitdb.WriteLatency

· Compression metadata memory usage drop, 
org.apache.cassandra.metrics.Keyspace.XXX. 
CompressionMetadataOffHeapMemoryUsed: ~218MB => ~105MB => Good or bad? Known?

I know, looks all a bit vague, but perhaps someone else has seen something 
similar when upgrading to 3.0.14 and can share their thoughts/ideas. Especially 
the (relative) CPU/GC increase is something we are curious about.

Thanks a lot.

Thomas
The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a 
company registered in Linz whose registered office is at 4040 Linz, Austria, 
Freistädterstraße 313
The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
t

RE: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

2017-09-18 Thread Steinmaurer, Thomas
Hello again,

digged a bit further. Comparing 1hr flight recording sessions for both, 2.1 and 
3.0 with the same incoming simulated load from our loadtest environment.

We are heavily write than read bound in this environment/scenario and it looks 
like there is a noticeable/measurable difference in 3.0 on what is happening 
underneath org.apache.cassandra.cql3.statements.BatchStatement.execute in both 
JFR/JMC areas, Code and Memory (allocation rate / object churn).

E.g. for org.apache.cassandra.cql3.statements.BatchStatement.execute, while JFR 
reports for the 1hr session a total TLAB size of 59,35 GB, it is 246,12 GB in 
Cassandra 3.0, so if this is trustworthy, a 4 times higher allocation rate in 
the BatchStatement.execute code path, which would explain the increased GC 
suspension since upgrading.

Is anybody aware of some kind of write-bound benchmarks of the storage engine 
in 3.0 in context of CPU/GC and not disk savings?

Thanks,
Thomas


From: Steinmaurer, Thomas [mailto:thomas.steinmau...@dynatrace.com]
Sent: Freitag, 15. September 2017 13:51
To: user@cassandra.apache.org
Subject: RE: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

Hi Jeff,

we are using native (CQL3) via Java DataStax driver (3.1). We also have 
OpsCenter running (to be removed soon) via Thrift, if I remember correctly.

As said, the write request latency for our keyspace hasn’t really changed, so 
perhaps another one (system related, OpsCenter …?) is affected or perhaps the 
JMX metric is reporting something differently now. ☺ So not a real issue for 
now hopefully, just popping up in our monitoring, wondering what this may be.

Regarding compression metadata memory usage drop. Right, storage engine 
re-write could be a reason. Thanks.

Still wondering about the GC/CPU increase.

Thanks!

Thomas



From: Jeff Jirsa [mailto:jji...@gmail.com]
Sent: Freitag, 15. September 2017 13:14
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

Most people find 3.0 slightly slower than 2.1. The only thing that really 
stands out in your email is the huge change in 95% latency - that's atypical. 
Are you using thrift or native 9042)?  The decrease in compression metadata 
offheap usage is likely due to the increased storage efficiency of the storage 
engine (see Cassandra-8099).


--
Jeff Jirsa


On Sep 15, 2017, at 2:37 AM, Steinmaurer, Thomas 
<thomas.steinmau...@dynatrace.com<mailto:thomas.steinmau...@dynatrace.com>> 
wrote:
Hello,

we have a test (regression) environment hosted in AWS, which is used for auto 
deploying our software on a daily basis and attach constant load across all 
deployments. Basically to allow us to detect any regressions in our software on 
a daily basis.

On the Cassandra-side, this is single-node in AWS, m4.xlarge, EBS gp2, 8G heap, 
CMS. The environment has also been upgraded from Cassandra 2.1.18 to 3.0.14 at 
a certain point in time. Without running upgradesstables so far. We have not 
made any additional JVM/GC configuration change when going from 2.1.18 to 
3.0.14 on our own, thus, any self-made configuration changes (e.g. new gen heap 
size) for 2.1.18 are also in place with 3.0.14.

What we see after a time-frame of ~ 7 days (so, e.g. should not be caused by 
some sort of spiky compaction pattern) is an AVG increase in GC/CPU (most 
likely correlating):

· CPU: ~ 12% => ~ 17%

· GC Suspension: ~ 1,7% => 3,29%

In this environment not a big deal, but relatively we have a CPU increase of ~ 
50% (with increased GC most likely contributing). Something we have deal with 
when going into production (going into larger, multi-node loadtest environments 
first though).

Beside the CPU/GC shift, we also monitor the following noticeable changes 
(don’t know if they somehow correlate with the CPU/GC shift above):

· Increased AVG Write Client Requests Latency (95th Percentile), 
org.apache.cassandra.metrics.ClientRequest.Latency.Write: 6,05ms => 29,2ms, but 
almost constant (no change in) write client request latency for our particular 
keyspace only, org.apache.cassandra.metrics.Keyspace.ruxitdb.WriteLatency

· Compression metadata memory usage drop, 
org.apache.cassandra.metrics.Keyspace.XXX. 
CompressionMetadataOffHeapMemoryUsed: ~218MB => ~105MB => Good or bad? Known?

I know, looks all a bit vague, but perhaps someone else has seen something 
similar when upgrading to 3.0.14 and can share their thoughts/ideas. Especially 
the (relative) CPU/GC increase is something we are curious about.

Thanks a lot.

Thomas
The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
then destroy it. Dynatrace Austria Gm

RE: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

2017-09-15 Thread Steinmaurer, Thomas
Hi Jeff,

we are using native (CQL3) via Java DataStax driver (3.1). We also have 
OpsCenter running (to be removed soon) via Thrift, if I remember correctly.

As said, the write request latency for our keyspace hasn’t really changed, so 
perhaps another one (system related, OpsCenter …?) is affected or perhaps the 
JMX metric is reporting something differently now. ☺ So not a real issue for 
now hopefully, just popping up in our monitoring, wondering what this may be.

Regarding compression metadata memory usage drop. Right, storage engine 
re-write could be a reason. Thanks.

Still wondering about the GC/CPU increase.

Thanks!

Thomas



From: Jeff Jirsa [mailto:jji...@gmail.com]
Sent: Freitag, 15. September 2017 13:14
To: user@cassandra.apache.org
Subject: Re: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

Most people find 3.0 slightly slower than 2.1. The only thing that really 
stands out in your email is the huge change in 95% latency - that's atypical. 
Are you using thrift or native 9042)?  The decrease in compression metadata 
offheap usage is likely due to the increased storage efficiency of the storage 
engine (see Cassandra-8099).


--
Jeff Jirsa


On Sep 15, 2017, at 2:37 AM, Steinmaurer, Thomas 
<thomas.steinmau...@dynatrace.com<mailto:thomas.steinmau...@dynatrace.com>> 
wrote:
Hello,

we have a test (regression) environment hosted in AWS, which is used for auto 
deploying our software on a daily basis and attach constant load across all 
deployments. Basically to allow us to detect any regressions in our software on 
a daily basis.

On the Cassandra-side, this is single-node in AWS, m4.xlarge, EBS gp2, 8G heap, 
CMS. The environment has also been upgraded from Cassandra 2.1.18 to 3.0.14 at 
a certain point in time. Without running upgradesstables so far. We have not 
made any additional JVM/GC configuration change when going from 2.1.18 to 
3.0.14 on our own, thus, any self-made configuration changes (e.g. new gen heap 
size) for 2.1.18 are also in place with 3.0.14.

What we see after a time-frame of ~ 7 days (so, e.g. should not be caused by 
some sort of spiky compaction pattern) is an AVG increase in GC/CPU (most 
likely correlating):

· CPU: ~ 12% => ~ 17%

· GC Suspension: ~ 1,7% => 3,29%

In this environment not a big deal, but relatively we have a CPU increase of ~ 
50% (with increased GC most likely contributing). Something we have deal with 
when going into production (going into larger, multi-node loadtest environments 
first though).

Beside the CPU/GC shift, we also monitor the following noticeable changes 
(don’t know if they somehow correlate with the CPU/GC shift above):

· Increased AVG Write Client Requests Latency (95th Percentile), 
org.apache.cassandra.metrics.ClientRequest.Latency.Write: 6,05ms => 29,2ms, but 
almost constant (no change in) write client request latency for our particular 
keyspace only, org.apache.cassandra.metrics.Keyspace.ruxitdb.WriteLatency

· Compression metadata memory usage drop, 
org.apache.cassandra.metrics.Keyspace.XXX. 
CompressionMetadataOffHeapMemoryUsed: ~218MB => ~105MB => Good or bad? Known?

I know, looks all a bit vague, but perhaps someone else has seen something 
similar when upgrading to 3.0.14 and can share their thoughts/ideas. Especially 
the (relative) CPU/GC increase is something we are curious about.

Thanks a lot.

Thomas
The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a 
company registered in Linz whose registered office is at 4040 Linz, Austria, 
Freistädterstraße 313
The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a 
company registered in Linz whose registered office is at 4040 Linz, Austria, 
Freistädterstraße 313


Re: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

2017-09-15 Thread Jeff Jirsa
Most people find 3.0 slightly slower than 2.1. The only thing that really 
stands out in your email is the huge change in 95% latency - that's atypical. 
Are you using thrift or native 9042)?  The decrease in compression metadata 
offheap usage is likely due to the increased storage efficiency of the storage 
engine (see Cassandra-8099).


-- 
Jeff Jirsa


> On Sep 15, 2017, at 2:37 AM, Steinmaurer, Thomas 
>  wrote:
> 
> Hello,
>  
> we have a test (regression) environment hosted in AWS, which is used for auto 
> deploying our software on a daily basis and attach constant load across all 
> deployments. Basically to allow us to detect any regressions in our software 
> on a daily basis.
>  
> On the Cassandra-side, this is single-node in AWS, m4.xlarge, EBS gp2, 8G 
> heap, CMS. The environment has also been upgraded from Cassandra 2.1.18 to 
> 3.0.14 at a certain point in time. Without running upgradesstables so far. We 
> have not made any additional JVM/GC configuration change when going from 
> 2.1.18 to 3.0.14 on our own, thus, any self-made configuration changes (e.g. 
> new gen heap size) for 2.1.18 are also in place with 3.0.14.
>  
> What we see after a time-frame of ~ 7 days (so, e.g. should not be caused by 
> some sort of spiky compaction pattern) is an AVG increase in GC/CPU (most 
> likely correlating):
> · CPU: ~ 12% => ~ 17%
> · GC Suspension: ~ 1,7% => 3,29%
>  
> In this environment not a big deal, but relatively we have a CPU increase of 
> ~ 50% (with increased GC most likely contributing). Something we have deal 
> with when going into production (going into larger, multi-node loadtest 
> environments first though).
>  
> Beside the CPU/GC shift, we also monitor the following noticeable changes 
> (don’t know if they somehow correlate with the CPU/GC shift above):
> · Increased AVG Write Client Requests Latency (95th Percentile), 
> org.apache.cassandra.metrics.ClientRequest.Latency.Write: 6,05ms => 29,2ms, 
> but almost constant (no change in) write client request latency for our 
> particular keyspace only, 
> org.apache.cassandra.metrics.Keyspace.ruxitdb.WriteLatency
> · Compression metadata memory usage drop, 
> org.apache.cassandra.metrics.Keyspace.XXX. 
> CompressionMetadataOffHeapMemoryUsed: ~218MB => ~105MB => Good or bad? Known?
>  
> I know, looks all a bit vague, but perhaps someone else has seen something 
> similar when upgrading to 3.0.14 and can share their thoughts/ideas. 
> Especially the (relative) CPU/GC increase is something we are curious about.
>  
> Thanks a lot.
>  
> Thomas
> The contents of this e-mail are intended for the named addressee only. It 
> contains information that may be confidential. Unless you are the named 
> addressee or an authorized designee, you may not copy or use it, or disclose 
> it to anyone else. If you received it in error please notify us immediately 
> and then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) 
> is a company registered in Linz whose registered office is at 4040 Linz, 
> Austria, Freistädterstraße 313


GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

2017-09-15 Thread Steinmaurer, Thomas
Hello,

we have a test (regression) environment hosted in AWS, which is used for auto 
deploying our software on a daily basis and attach constant load across all 
deployments. Basically to allow us to detect any regressions in our software on 
a daily basis.

On the Cassandra-side, this is single-node in AWS, m4.xlarge, EBS gp2, 8G heap, 
CMS. The environment has also been upgraded from Cassandra 2.1.18 to 3.0.14 at 
a certain point in time. Without running upgradesstables so far. We have not 
made any additional JVM/GC configuration change when going from 2.1.18 to 
3.0.14 on our own, thus, any self-made configuration changes (e.g. new gen heap 
size) for 2.1.18 are also in place with 3.0.14.

What we see after a time-frame of ~ 7 days (so, e.g. should not be caused by 
some sort of spiky compaction pattern) is an AVG increase in GC/CPU (most 
likely correlating):

* CPU: ~ 12% => ~ 17%

* GC Suspension: ~ 1,7% => 3,29%

In this environment not a big deal, but relatively we have a CPU increase of ~ 
50% (with increased GC most likely contributing). Something we have deal with 
when going into production (going into larger, multi-node loadtest environments 
first though).

Beside the CPU/GC shift, we also monitor the following noticeable changes 
(don't know if they somehow correlate with the CPU/GC shift above):

* Increased AVG Write Client Requests Latency (95th Percentile), 
org.apache.cassandra.metrics.ClientRequest.Latency.Write: 6,05ms => 29,2ms, but 
almost constant (no change in) write client request latency for our particular 
keyspace only, org.apache.cassandra.metrics.Keyspace.ruxitdb.WriteLatency

* Compression metadata memory usage drop, 
org.apache.cassandra.metrics.Keyspace.XXX. 
CompressionMetadataOffHeapMemoryUsed: ~218MB => ~105MB => Good or bad? Known?

I know, looks all a bit vague, but perhaps someone else has seen something 
similar when upgrading to 3.0.14 and can share their thoughts/ideas. Especially 
the (relative) CPU/GC increase is something we are curious about.

Thanks a lot.

Thomas
The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a 
company registered in Linz whose registered office is at 4040 Linz, Austria, 
Freist?dterstra?e 313