C. Scott Andreas updated CASSANDRA-13900:
    Component/s: Core

> Massive GC suspension increase after updating to 3.0.14 from 2.1.18
> -------------------------------------------------------------------
>                 Key: CASSANDRA-13900
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13900
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Thomas Steinmaurer
>            Priority: Blocker
>         Attachments: cassandra2118_vs_3014.jpg, cassandra3014_jfr_5min.jpg, 
> cassandra_3.11.0_min_memory_utilization.jpg
> In short: After upgrading to 3.0.14 (from 2.1.18), we aren't able to process 
> the same incoming write load on the same infrastructure anymore.
> We have a loadtest environment running 24x7 testing our software using 
> Cassandra as backend. Both, loadtest and production is hosted in AWS and do 
> have the same spec on the Cassandra-side, namely:
> * 9x m4.xlarge
> * 8G heap
> * CMS (400MB newgen)
> * 2TB EBS gp2
> * Client requests are entirely CQL
> per node. We have a solid/constant baseline in loadtest at ~ 60% CPU cluster 
> AVG with constant, simulated load running against our cluster, using 
> Cassandra 2.1 for > 2 years now.
> Recently we started to upgrade to 3.0.14 in this 9 node loadtest environment, 
> and basically, 3.0.14 isn't able to cope with the load anymore. No particular 
> special tweaks, memory settings/changes etc., all the same as in 2.1.18. We 
> also didn't upgrade sstables yet, thus the increase mentioned in the 
> screenshot is not related to any manually triggered maintenance operation 
> after upgrading to 3.0.14.
> According to our monitoring, with 3.0.14, we see a *GC suspension time 
> increase by a factor of > 2*, of course directly correlating with an CPU 
> increase > 80%. See: attached screen "cassandra2118_vs_3014.jpg"
> This all means that our incoming load against 2.1.18 is something, 3.0.14 
> can't handle. So, we would need to either scale up (e.g. m4.xlarge => 
> m4.2xlarge) or scale out for being able to handle the same load, which is 
> cost-wise not an option.
> Unfortunately I do not have Java Flight Recorder runs for 2.1.18 at the 
> mentioned load, but can provide JFR session for our current 3.0.14 setup. The 
> attached 5min JFR memory allocation area (cassandra3014_jfr_5min.jpg) shows 
> compaction being the top contributor for the captured 5min time-frame. Could 
> be by "accident" covering the 5min with compaction as top contributor only 
> (although mentioned simulated client load is attached), but according to 
> stack traces, we see new classes from 3.0, e.g. BTreeRow.searchIterator() 
> etc. popping up as top contributor, thus possibly new classes / data 
> structures are causing much more object churn now.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to