[
https://issues.apache.org/jira/browse/CASSANDRA-10150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706772#comment-14706772
]
Jorge Rodriguez commented on CASSANDRA-10150:
---------------------------------------------
We came across this thread from Benedict on the jmx-dev mailing list yesterday,
and we implemented the workaround he recommended here:
http://mail.openjdk.java.net/pipermail/jmx-dev/2014-February/000585.html
Which is to enable the flag: "CMSClassUnloadingEnabled"
Since we enabled this flag yesterday we are not seeing the memory leak.
Performance also hasn't been impacted by this so far either it seems.
> Cassandra read latency potentially caused by memory leak
> --------------------------------------------------------
>
> Key: CASSANDRA-10150
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10150
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Environment: cassandra 2.0.12
> Reporter: Cheng Ren
>
> We are currently migrating to a new cassandra cluster which is multi-region
> on ec2. Our previous cluster was also on ec2 but only in the east region.
> In addition we have upgraded to cassandra 2.0.12 from 2.0.4 and from ubuntu
> 12 to 14.
> We are investigating a cassandra latency problem on our new cluster. The
> symptom is that over a long period of time (12-16 hours) the TP90-95 read
> latency degrades to the point of being well above our SLA's. During normal
> operation our TP95 for a 50key lookup is 75ms, when fully degraded, we are
> facing 300ms TP95 latencies. Doing a rolling restart resolves the problem.
> We are noticing a high correlation between the Old Gen heap usage (and how
> much is freed up) and the high latencies. We are running with a max heap
> size of 12GB and a max new-gen size of 2GB.
> Below is a chart of the heap usage over a 24 hour period. Right below it is
> a chart of TP95 latencies (was a mixed workload of 50 and single key
> lookups), the third image is a look at CMS Old Gen memory usage:
> Overall heap usage over 24 hrs:
> !https://dl.dropboxusercontent.com/u/303980955/1.png|height=300,width=500!
> TP95 latencies over 24 hours:
> !https://dl.dropboxusercontent.com/u/303980955/2.png|height=300,width=500!
> OldGen memory usage over 24 hours:
> !https://dl.dropboxusercontent.com/u/303980955/3.png|height=300,width=500!
> You can see from this that the old gen section of our heap is what is using
> up the majority of the heap space. We cannot figure out why the memory is
> not being collected during a full GC. For reference, in our old cassandra
> cluster, the behavior is that the full GC will clear up the majority of the
> heap space. See image below from an old production node operating normally:
> !https://dl.dropboxusercontent.com/u/303980955/4.png|height=300,width=500!
> From heap dump file we found that most memory is consumed by unreachable
> objects. With further analysis we were able to see those objects are
> RMIConnectionImpl$CombinedClassLoader$ClassLoaderWrapper (holding 4GB of
> memory) and java.security.ProtectionDomain (holding 2GB) . The only place we
> know Cassandra is using RMI is in JMX, but
> does anyone has any clue on where else those objects are used? And Why do
> they take so much memory?
> Or It would be great if someone could offer any further debugging tips on the
> latency or GC issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)