[jira] [Commented] (CASSANDRA-15400) Cassandra 3.0.18 went OOM several hours after joining a cluster

Thomas Steinmaurer (Jira) Wed, 13 Nov 2019 00:13:25 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973109#comment-16973109
 ]


Thomas Steinmaurer commented on CASSANDRA-15400:
------------------------------------------------

 [~bdeggleston], thanks for the follow-up. Yes, causing quite some pain in prod 
in the moment, e.g. yesterday evening, close to running OOM again.
!oldgen_increase_nov12.jpg!

> Cassandra 3.0.18 went OOM several hours after joining a cluster
> ---------------------------------------------------------------
>
>                 Key: CASSANDRA-15400
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15400
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/SSTable
>            Reporter: Thomas Steinmaurer
>            Assignee: Blake Eggleston
>            Priority: Normal
>             Fix For: 3.0.20, 3.11.6, 4.0
>
>         Attachments: cassandra_hprof_bigtablereader_statsmetadata.png, 
> cassandra_hprof_dominator_classes.png, cassandra_hprof_statsmetadata.png, 
> cassandra_jvm_metrics.png, cassandra_operationcount.png, 
> cassandra_sstables_pending_compactions.png, image.png, 
> oldgen_increase_nov12.jpg
>
>
> We have been moving from Cassandra 2.1.18 to Cassandra 3.0.18 and have been 
> facing an OOM two times with 3.0.18 on newly added nodes joining an existing 
> cluster after several hours being successfully bootstrapped.
> Running in AWS:
> * m5.2xlarge, EBS SSD (gp2)
> * Xms/Xmx12G, Xmn3G, CMS GC, OpenJDK8u222
> * 4 compaction threads, throttling set to 32 MB/s
> What we see is a steady increase in the OLD gen over many hours.
> !cassandra_jvm_metrics.png!
> * The node started to join / auto-bootstrap the cluster on Oct 30 ~ 12:00
> * It basically finished joining the cluster (UJ => UN) ~ 19hrs later on Oct 
> 31 ~ 07:00 also starting to be a member of serving client read requests
> !cassandra_operationcount.png!
> Memory-wise (on-heap) it didn't look that bad at that time, but old gen usage 
> constantly increased.
> We see a correlation in increased number of SSTables and pending compactions.
> !cassandra_sstables_pending_compactions.png!
> Until we reached the OOM somewhere in Nov 1 in the night. After a Cassandra 
> startup (metric gap in the chart above), number of SSTables + pending 
> compactions is still high, but without facing memory troubles since then.
> This correlation is confirmed by the auto-generated heap dump with e.g. ~ 5K 
> BigTableReader instances with ~ 8.7GByte retained heap in total.
> !cassandra_hprof_dominator_classes.png!
> Having a closer look on a single object instance, seems like each instance is 
> ~ 2MByte in size.
> !cassandra_hprof_bigtablereader_statsmetadata.png!
> With 2 pre-allocated byte buffers (highlighted in the screen above) at 1 
> MByte each
> We have been running with 2.1.18 for > 3 years and I can't remember dealing 
> with such OOM in the context of extending a cluster.
> While the MAT screens above are from our production cluster, we partly can 
> reproduce this behavior in our loadtest environment (although not going full 
> OOM there), thus I might be able to share a hprof from this non-prod 
> environment if needed.
> Thanks a lot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15400) Cassandra 3.0.18 went OOM several hours after joining a cluster

Reply via email to