RE: Cassandra 3.0.18 went OOM several hours after joining a cluster

Steinmaurer, Thomas Wed, 06 Nov 2019 11:43:11 -0800

Reid,

thanks for thoughts.


I agree with your last comment and I’m pretty sure/convinced that the 
increasing number of SSTables is causing the issue, although I’m not sure if 
compaction or read requests (after the node flipped from UJ to UN) or both, but 
I tend more towards client read requests resulting in accessing a high number 
of SSTables which basically results in ~ 2Mbyte on-heap usage per 
BigTableReader instance, with ~ 5K such object instances on the heap.

The big question for us is why this starts to pop-up with Cas 3.0 without 
seeing this with 2.1 in > 3 years production usage.

To avoid double work, I will try to continue providing additional information / 
thoughts on the Cassandra ticket.

Regards,
Thomas

From: Reid Pinchback <rpinchb...@tripadvisor.com>
Sent: Mittwoch, 06. November 2019 18:28
To: user@cassandra.apache.org
Subject: Re: Cassandra 3.0.18 went OOM several hours after joining a cluster

The other thing that comes to mind is that the increase in pending compactions 
suggests back pressure on compaction activity.  GC is only one possible source 
of that.  Between your throughput setting and how your disk I/O is set up, 
maybe that’s throttling you to a rate where the rate of added reasons for 
compactions > the rate of compactions completed.

In fact, the more that I think about it, I wonder about that a lot.

If you can’t keep up with compactions, then operations have to span more and 
more SSTables over time.  You’ll keep holding on to what you read, as you read 
more of them, until eventually…pop.


From: Reid Pinchback 
<rpinchb...@tripadvisor.com<mailto:rpinchb...@tripadvisor.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Wednesday, November 6, 2019 at 12:11 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Cassandra 3.0.18 went OOM several hours after joining a cluster

Message from External Sender
My first thought was that you were running into the merkle tree depth problem, 
but the details on the ticket don’t seem to confirm that.

It does look like eden is too small.   C* lives in Java’s GC pain point, a lot 
of medium-lifetime objects.  If you haven’t already done so, you’ll want to 
configure as many things to be off-heap as you can, but I’d definitely look at 
improving the ratio of eden to old gen, and see if you can get the young gen GC 
activity to be more successful at sweeping away the medium-lived objects.

All that really comes to mind is if you’re getting to a point where GC isn’t 
coping.  That can be hard to sometimes spot on metrics with coarse granularity. 
 Per-second metrics might show CPU cores getting pegged.

I’m not sure that GC tuning eliminates this problem, but if it isn’t being 
caused by that, GC tuning may at least improve the visibility of the underlying 
problem.

From: "Steinmaurer, Thomas" 
<thomas.steinmau...@dynatrace.com<mailto:thomas.steinmau...@dynatrace.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Wednesday, November 6, 2019 at 11:27 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Cassandra 3.0.18 went OOM several hours after joining a cluster

Message from External Sender
Hello,

after moving from 2.1.18 to 3.0.18, we are facing OOM situations after several 
hours a node has successfully joined a cluster (via auto-bootstrap).

I have created the following ticket trying to describe the situation, including 
hprof / MAT screens: 
https://issues.apache.org/jira/browse/CASSANDRA-15400<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__issues.apache.org_jira_browse_CASSANDRA-2D15400%26d%3DDwMF-g%26c%3D9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA%26r%3DOIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc%26m%3DlnQdpMrbVjmjj_af9BwSn1ftI8H2uSyvAya3887aDLk%26s%3DBEeQbrRZS6Z1i25NSdwRmQVpQ36AvSNz_i8Y9ks5UmA%26e%3D&data=02%7C01%7Cthomas.steinmaurer%40dynatrace.com%7C8d53c19106b84b0e4fef08d762dfaad4%7C70ebe3a35b30435d9d677716d74ca190%7C1%7C0%7C637086585097094534&sdata=BMfphm5RaKTpKXwQxLCoL5ePfe9hQg9pHnNAp5e27xQ%3D&reserved=0>

Would be great if someone could have a look.

Thanks a lot.

Thomas
The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a 
company registered in Linz whose registered office is at 4040 Linz, Austria, 
Freistädterstraße 313
The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a 
company registered in Linz whose registered office is at 4040 Linz, Austria, 
Freistädterstraße 313

RE: Cassandra 3.0.18 went OOM several hours after joining a cluster

Reply via email to