[
https://issues.apache.org/jira/browse/CASSANDRA-13900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thomas Steinmaurer updated CASSANDRA-13900:
-------------------------------------------
Resolution: Duplicate
Status: Resolved (was: Open)
DUP of CASSANDRA-16201
> Massive GC suspension increase after updating to 3.0.14 from 2.1.18
> -------------------------------------------------------------------
>
> Key: CASSANDRA-13900
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13900
> Project: Cassandra
> Issue Type: Bug
> Components: Legacy/Core
> Reporter: Thomas Steinmaurer
> Priority: Urgent
> Attachments: cassandra2118_vs_3014.jpg, cassandra3014_jfr_5min.jpg,
> cassandra_3.11.0_min_memory_utilization.jpg
>
>
> In short: After upgrading to 3.0.14 (from 2.1.18), we aren't able to process
> the same incoming write load on the same infrastructure anymore.
> We have a loadtest environment running 24x7 testing our software using
> Cassandra as backend. Both, loadtest and production is hosted in AWS and do
> have the same spec on the Cassandra-side, namely:
> * 9x m4.xlarge
> * 8G heap
> * CMS (400MB newgen)
> * 2TB EBS gp2
> * Client requests are entirely CQL
> per node. We have a solid/constant baseline in loadtest at ~ 60% CPU cluster
> AVG with constant, simulated load running against our cluster, using
> Cassandra 2.1 for > 2 years now.
> Recently we started to upgrade to 3.0.14 in this 9 node loadtest environment,
> and basically, 3.0.14 isn't able to cope with the load anymore. No particular
> special tweaks, memory settings/changes etc., all the same as in 2.1.18. We
> also didn't upgrade sstables yet, thus the increase mentioned in the
> screenshot is not related to any manually triggered maintenance operation
> after upgrading to 3.0.14.
> According to our monitoring, with 3.0.14, we see a *GC suspension time
> increase by a factor of > 2*, of course directly correlating with an CPU
> increase > 80%. See: attached screen "cassandra2118_vs_3014.jpg"
> This all means that our incoming load against 2.1.18 is something, 3.0.14
> can't handle. So, we would need to either scale up (e.g. m4.xlarge =>
> m4.2xlarge) or scale out for being able to handle the same load, which is
> cost-wise not an option.
> Unfortunately I do not have Java Flight Recorder runs for 2.1.18 at the
> mentioned load, but can provide JFR session for our current 3.0.14 setup. The
> attached 5min JFR memory allocation area (cassandra3014_jfr_5min.jpg) shows
> compaction being the top contributor for the captured 5min time-frame. Could
> be by "accident" covering the 5min with compaction as top contributor only
> (although mentioned simulated client load is attached), but according to
> stack traces, we see new classes from 3.0, e.g. BTreeRow.searchIterator()
> etc. popping up as top contributor, thus possibly new classes / data
> structures are causing much more object churn now.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]