[ https://issues.apache.org/jira/browse/CASSANDRA-15400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thomas Steinmaurer updated CASSANDRA-15400: ------------------------------------------- Attachment: cassandra_hprof_statsmetadata.png > Cassandra 3.0.18 went OOM several hours after joining a cluster > --------------------------------------------------------------- > > Key: CASSANDRA-15400 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15400 > Project: Cassandra > Issue Type: Bug > Reporter: Thomas Steinmaurer > Assignee: Blake Eggleston > Priority: Normal > Attachments: cassandra_hprof_bigtablereader_statsmetadata.png, > cassandra_hprof_dominator_classes.png, cassandra_hprof_statsmetadata.png, > cassandra_jvm_metrics.png, cassandra_operationcount.png, > cassandra_sstables_pending_compactions.png > > > We have been moving from Cassandra 2.1.18 to Cassandra 3.0.18 and have been > facing an OOM two times with 3.0.18 on newly added nodes joining an existing > cluster after several hours being successfully bootstrapped. > Running in AWS: > * m5.2xlarge, EBS SSD (gp2) > * Xms/Xmx12G, Xmn3G, CMS GC > * 4 compaction threads, throttling set to 32 MB/s > What we see is a steady increase in the OLD gen over many hours. > !cassandra_jvm_metrics.png! > * The node started to join / auto-bootstrap the cluster on Oct 30 ~ 12:00 > * It basically finished joining the cluster (UJ => UN) ~ 19hrs later on Oct > 31 ~ 07:00 also starting to be a member of serving client read requests > !cassandra_operationcount.png! > Memory-wise (on-heap) it didn't look that bad at that time, but old gen usage > constantly increased. > We see a correlation in increased number of SSTables and pending compactions. > !cassandra_sstables_pending_compactions.png! > Until we reached the OOM somewhere in Nov 1 in the night. After a Cassandra > startup (metric gap in the chart above), number of SSTables + pending > compactions is still high, but without facing memory troubles since then. > This correlation is confirmed by the auto-generated heap dump with e.g. ~ 5K > BigTableReader instances with ~ 8.7GByte retained heap in total. > !cassandra_hprof_dominator_classes.png! > Having a closer look on a single object instance, seems like each instance is > ~ 2MByte in size. > !cassandra_hprof_bigtablereader_statsmetadata.png! > With 2 pre-allocated byte buffers (highlighted in the screen above) at 1 > MByte each > We have been running with 2.1.18 for > 3 years and I can't remember dealing > with such OOM in the context of extending a cluster. > While the MAT screens above are from our production cluster, we partly can > reproduce this behavior in our loadtest environment (although not going full > OOM there), thus I might be able to share a hprof from this non-prod > environment if needed. > Thanks a lot. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org