[
https://issues.apache.org/jira/browse/CASSANDRA-13900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thomas Steinmaurer updated CASSANDRA-13900:
-------------------------------------------
Description:
In short: After upgrading to 3.0.14 (2.1.18), we aren't able to process the
same incoming write load on the same infrastructure anymore.
We have a loadtest environment running 24x7 testing our software using
Cassandra as backend. Both, loadtest and production is hosted in AWS and do
have the same spec on the Cassandra-side, namely:
* 9x m4.xlarge
* 8G heap
* CMS (400MB newgen)
* 2TB EBS gp2
* Client requests are entirely CQL
per node. We have a solid/constant baseline in loadtest at ~ 60% CPU cluster
AVG with constant, simulated load running against our cluster, using Cassandra
2.1 for > 2 years now.
Recently we started to upgrade to 3.0.14 in this 9 node loadtest environment,
and basically, 3.0.14 isn't able to cope with the load anymore. No particular
special tweaks, memory settings/changes etc., all the same as in 2.1.8. We also
didn't upgrade sstables yet, thus the increase mentioned in the screenshot is
not related to any manually triggered maintenance operation after upgrading to
3.0.14.
According to our monitoring, with 3.0.14, we see a *GC suspension time increase
by a factor of > 2*, of course directly correlating with an CPU increase > 80%.
See: attached screen "cassandra2118_vs_3014.jpg"
This all means that our incoming load against 2.1.18 is something, 3.0.14 can't
handle. So, we would need to either scale up (e.g. m4.xlarge => m4.2xlarge) or
scale out for being able to handle the same load, which is cost-wise not an
option.
Unfortunately I do not have Java Flight Recorder runs for 2.1.18 at the
mentioned load, but can provide JFR session for our current 3.0.14 setup
was:
In short: After upgrading to 3.0.14 (2.1.18), we aren't able to process the
same incoming write load on the same infrastructure anymore.
We have a loadtest environment running 24x7 testing our software using
Cassandra as backend. Both, loadtest and production is hosted in AWS and do
have the same spec on the Cassandra-side, namely:
* 9x m4.xlarge
* 8G heap
* CMS (400MB newgen)
* 2TB EBS gp2
* Client requests are entirely CQL
per node. We have a solid/constant baseline in loadtest at ~ 60% CPU cluster
AVG with constant, simulated load running against our cluster, using Cassandra
2.1 for > 2 years now.
Recently we started to upgrade to 3.0.14 in this 9 node loadtest environment,
and basically, 3.0.14 isn't able to cope with the load anymore. No particular
special tweaks, memory settings/changes etc., all the same as in 2.1.8. We also
didn't upgrade sstables yet, thus the increase mentioned in the screenshot is
not related to any manually triggered maintenance operation after upgrading to
3.0.14.
According to our monitoring, with 3.0.14, we see a *GC suspension time increase
by a factor of > 2*, of course directly correlating with an CPU increase > 80%.
See: attached screen "cassandra2118_vs_3014.jpg"
This all means that our incoming load against 2.1.18 is something, 3.0.14 can't
handle. So, we would need to either scale up (e.g. m4.xlarge => m4.2xlarge) or
scale out for being able to handle the same load, which is cost-wise not an
option.
Unfortunately I do not have Java Flight Recorder runs for 2.1.18 at the
mentioned load, but can provide JFR session for our current 3.0.14 setup, if
needed.
> Massive GC suspension increase after updating to 3.0.14 from 2.1.18
> -------------------------------------------------------------------
>
> Key: CASSANDRA-13900
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13900
> Project: Cassandra
> Issue Type: Bug
> Reporter: Thomas Steinmaurer
> Priority: Blocker
> Attachments: cassandra2118_vs_3014.jpg, cassandra3014_jfr_5min.jpg
>
>
> In short: After upgrading to 3.0.14 (2.1.18), we aren't able to process the
> same incoming write load on the same infrastructure anymore.
> We have a loadtest environment running 24x7 testing our software using
> Cassandra as backend. Both, loadtest and production is hosted in AWS and do
> have the same spec on the Cassandra-side, namely:
> * 9x m4.xlarge
> * 8G heap
> * CMS (400MB newgen)
> * 2TB EBS gp2
> * Client requests are entirely CQL
> per node. We have a solid/constant baseline in loadtest at ~ 60% CPU cluster
> AVG with constant, simulated load running against our cluster, using
> Cassandra 2.1 for > 2 years now.
> Recently we started to upgrade to 3.0.14 in this 9 node loadtest environment,
> and basically, 3.0.14 isn't able to cope with the load anymore. No particular
> special tweaks, memory settings/changes etc., all the same as in 2.1.8. We
> also didn't upgrade sstables yet, thus the increase mentioned in the
> screenshot is not related to any manually triggered maintenance operation
> after upgrading to 3.0.14.
> According to our monitoring, with 3.0.14, we see a *GC suspension time
> increase by a factor of > 2*, of course directly correlating with an CPU
> increase > 80%. See: attached screen "cassandra2118_vs_3014.jpg"
> This all means that our incoming load against 2.1.18 is something, 3.0.14
> can't handle. So, we would need to either scale up (e.g. m4.xlarge =>
> m4.2xlarge) or scale out for being able to handle the same load, which is
> cost-wise not an option.
> Unfortunately I do not have Java Flight Recorder runs for 2.1.18 at the
> mentioned load, but can provide JFR session for our current 3.0.14 setup
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]