[jira] [Comment Edited] (CASSANDRA-15400) Cassandra 3.0.18 went OOM several hours after joining a cluster

Thomas Steinmaurer (Jira) Wed, 06 Nov 2019 14:19:51 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968762#comment-16968762
 ]


Thomas Steinmaurer edited comment on CASSANDRA-15400 at 11/6/19 10:18 PM:
--------------------------------------------------------------------------

[~marcuse], the data model has evolved starting with Astyanax/Thrift moved over 
to pure CQL3 access (without real data migration), but still with our own 
application-side serializer framework, working with byte buffers, thus BLOBs on 
the data model side.

Our high volume (usually > 1TByte per node, RF=3) CF/table looks like that, 
where we also see the majority of increasing number of pending compaction 
tasks, according to a per-CF JMX based self-monitoring:
{noformat}
CREATE TABLE ks.cf1 (
    k blob,
    n blob,
    v blob,
    PRIMARY KEY (k, n)
) WITH COMPACT STORAGE
...
;
{noformat}
Although we tend to also have single partitions in the area of > 100MByte, e.g. 
visible due to according compaction logs in the Cassandra log, all not being a 
real problem in practice with a heap of Xms/Xmx12G resp Xmn3G and Cas 2.1.

A few additional thoughts:
 * Likely the Cassandra node is utilizing most of the compaction threads (4 in 
this scenario with the m5.2xlarge instance type) with larger compactions on 
streamed data, giving less room for compactions of live data / actual writes 
while being in UJ, resulting in accessing much more smaller SSTables (looks 
like we have/had plenty in the area of 10-50MByte) then in UN starting to serve 
read requests
 * Is there anything known in Cas 3.0, which might result in streaming more 
data from other nodes compared to 2.1 resulting in increased compaction work to 
be done for newly joined nodes compared to 2.1
 * Is there anything known in Cas 3.0, which results in more frequent memtable 
flushes compared to 2.1, again resulting in increased compaction work
 * Talking about a single {{BigTableReader}} instance again, did anything 
change in regard to byte buffer pre-allocation at 1MByte in {{StatsMetadata}} 
per data member {{minClusteringValues}} and {{maxClusteringValues}} as shown in 
the hprof? Looks to me we potentially waste quite some on-heap memory here 
 !cassandra_hprof_statsmetadata.png|width=800!
* Is {{StatsMetadata}} purely on-heap? Or is it somehow pulled from off-heap 
first resulting in the 1MByte allocation, reminding me a bit on the NIO cache 
buffer bug 
(https://support.datastax.com/hc/en-us/articles/360000863663-JVM-OOM-direct-buffer-errors-affected-by-unlimited-java-nio-cache),
 with a recommendation setting it to exactly the number 
(-Djdk.nio.maxCachedBufferSize=1048576) we see in the hprof for the on-heap 
byte buffer

Number of compaction threads, compaction throttling is unchanged during the 
upgrade from 2.1 to 3.0 and if memory serves me well, we should see improved 
compaction throughput in 3.0 with the same throttling settings anyway.


was (Author: tsteinmaurer):
[~marcuse], the data model has evolved starting with Astyanax/Thrift moved over 
to pure CQL3 access (without real data migration), but still with our own 
application-side serializer framework, working with byte buffers, thus BLOBs on 
the data model side.

Our high volume (usually > 1TByte per node, RF=3) CF/table looks like that, 
where we also see the majority of increasing number of pending compaction 
tasks, according to a per-CF JMX based self-monitoring:
{noformat}
CREATE TABLE ks.cf1 (
    k blob,
    n blob,
    v blob,
    PRIMARY KEY (k, n)
) WITH COMPACT STORAGE
...
;
{noformat}
Although we tend to also have single partitions in the area of > 100MByte, e.g. 
visible due to according compaction logs in the Cassandra log, all not being a 
real problem in practice with a heap of Xms/Xmx12G resp Xmn3G and Cas 2.1.

A few additional thoughts:
 * Likely the Cassandra node is utilizing most of the compaction threads (4 in 
this scenario with the m5.2xlarge instance type) with larger compactions on 
streamed data, giving less room for compactions of live data / actual writes 
while being in UJ, resulting in accessing much more smaller SSTables (looks 
like we have/had plenty in the area of 10-50MByte) then in UN starting to serve 
read requests
 * Is there anything known in Cas 3.0, which might result in streaming more 
data from other nodes compared to 2.1 resulting in increased compaction work to 
be done for newly joined nodes compared to 2.1
 * Is there anything known in Cas 3.0, which results in more frequent memtable 
flushes compared to 2.1, again resulting in increased compaction work
 * Talking about a single {{BigTableReader}} instance again, did anything 
change in regard to byte buffer pre-allocation at 1MByte in {{StatsMetadata}} 
per data member {{minClusteringValues}} and {{maxClusteringValues}} as shown in 
the hprof? Looks to me we potentially waste quite some on-heap memory here 
 !cassandra_hprof_statsmetadata.png|width=800!
* Is {{StatsMetadata}} purely on-heap? Or is it somehow pulled from off-heap 
first resulting in the 1MByte allocation, reminding me a bit on the NIO cache 
buffer bug 
(https://support.datastax.com/hc/en-us/articles/360000863663-JVM-OOM-direct-buffer-errors-affected-by-unlimited-java-nio-cache),
 with a recommendation setting it to exactly the number 
(-Djdk.nio.maxCachedBufferSize=1048576) we see in the hprof for the on-heap 
byte buffer

Number of compaction threads, compaction throttling is unchanged during the 
upgrade from 2.1 to 3.0.

> Cassandra 3.0.18 went OOM several hours after joining a cluster
> ---------------------------------------------------------------
>
>                 Key: CASSANDRA-15400
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15400
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Thomas Steinmaurer
>            Assignee: Blake Eggleston
>            Priority: Normal
>         Attachments: cassandra_hprof_bigtablereader_statsmetadata.png, 
> cassandra_hprof_dominator_classes.png, cassandra_hprof_statsmetadata.png, 
> cassandra_jvm_metrics.png, cassandra_operationcount.png, 
> cassandra_sstables_pending_compactions.png
>
>
> We have been moving from Cassandra 2.1.18 to Cassandra 3.0.18 and have been 
> facing an OOM two times with 3.0.18 on newly added nodes joining an existing 
> cluster after several hours being successfully bootstrapped.
> Running in AWS:
> * m5.2xlarge, EBS SSD (gp2)
> * Xms/Xmx12G, Xmn3G, CMS GC
> * 4 compaction threads, throttling set to 32 MB/s
> What we see is a steady increase in the OLD gen over many hours.
> !cassandra_jvm_metrics.png!
> * The node started to join / auto-bootstrap the cluster on Oct 30 ~ 12:00
> * It basically finished joining the cluster (UJ => UN) ~ 19hrs later on Oct 
> 31 ~ 07:00 also starting to be a member of serving client read requests
> !cassandra_operationcount.png!
> Memory-wise (on-heap) it didn't look that bad at that time, but old gen usage 
> constantly increased.
> We see a correlation in increased number of SSTables and pending compactions.
> !cassandra_sstables_pending_compactions.png!
> Until we reached the OOM somewhere in Nov 1 in the night. After a Cassandra 
> startup (metric gap in the chart above), number of SSTables + pending 
> compactions is still high, but without facing memory troubles since then.
> This correlation is confirmed by the auto-generated heap dump with e.g. ~ 5K 
> BigTableReader instances with ~ 8.7GByte retained heap in total.
> !cassandra_hprof_dominator_classes.png!
> Having a closer look on a single object instance, seems like each instance is 
> ~ 2MByte in size.
> !cassandra_hprof_bigtablereader_statsmetadata.png!
> With 2 pre-allocated byte buffers (highlighted in the screen above) at 1 
> MByte each
> We have been running with 2.1.18 for > 3 years and I can't remember dealing 
> with such OOM in the context of extending a cluster.
> While the MAT screens above are from our production cluster, we partly can 
> reproduce this behavior in our loadtest environment (although not going full 
> OOM there), thus I might be able to share a hprof from this non-prod 
> environment if needed.
> Thanks a lot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15400) Cassandra 3.0.18 went OOM several hours after joining a cluster

Reply via email to