[ 
https://issues.apache.org/jira/browse/CASSANDRA-10150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706772#comment-14706772
 ] 

Jorge Rodriguez commented on CASSANDRA-10150:
---------------------------------------------

We came across this thread from Benedict on the jmx-dev mailing list yesterday, 
and we implemented the workaround he recommended here: 
http://mail.openjdk.java.net/pipermail/jmx-dev/2014-February/000585.html
Which is to enable the flag: "CMSClassUnloadingEnabled"

Since we enabled this flag yesterday we are not seeing the memory leak.  
Performance also hasn't been impacted by this so far either it seems.

> Cassandra read latency potentially caused by memory leak
> --------------------------------------------------------
>
>                 Key: CASSANDRA-10150
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10150
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: cassandra 2.0.12
>            Reporter: Cheng Ren
>
>   We are currently migrating to a new cassandra cluster which is multi-region 
> on ec2.  Our previous cluster was also on ec2 but only in the east region.  
> In addition we have upgraded to cassandra 2.0.12 from 2.0.4 and from ubuntu 
> 12 to 14.
>   We are investigating a cassandra latency problem on our new cluster.  The 
> symptom is that over a long period of time (12-16 hours) the TP90-95 read 
> latency degrades to the point of being well above our SLA's.  During normal 
> operation our TP95 for a 50key lookup is 75ms, when fully degraded, we are 
> facing 300ms TP95 latencies.  Doing a rolling restart resolves the problem.
> We are noticing a high correlation between the Old Gen heap usage (and how 
> much is freed up) and the high latencies.  We are running with a max heap 
> size of 12GB and a max new-gen size of 2GB.
> Below is a chart of the heap usage over a 24 hour period.  Right below it is 
> a chart of TP95 latencies (was a mixed workload of 50 and single key 
> lookups), the third image is a look at CMS Old Gen memory usage:
> Overall heap usage over 24 hrs:
> !https://dl.dropboxusercontent.com/u/303980955/1.png|height=300,width=500!
> TP95 latencies over 24 hours:
> !https://dl.dropboxusercontent.com/u/303980955/2.png|height=300,width=500!
> OldGen memory usage over 24 hours:
> !https://dl.dropboxusercontent.com/u/303980955/3.png|height=300,width=500!
>  You can see from this that the old gen section of our heap is what is using 
> up the majority of the heap space.  We cannot figure out why the memory is 
> not being collected during a full GC.  For reference, in our old cassandra 
> cluster, the behavior is that the full GC will clear up the majority of the 
> heap space.  See image below from an old production node operating normally:
> !https://dl.dropboxusercontent.com/u/303980955/4.png|height=300,width=500!
> From heap dump file we found that most memory is consumed by unreachable 
> objects. With further analysis we were able to see those objects are 
> RMIConnectionImpl$CombinedClassLoader$ClassLoaderWrapper (holding 4GB of 
> memory) and java.security.ProtectionDomain (holding 2GB) . The only place we 
> know Cassandra is using RMI is in JMX, but
> does anyone has any clue on where else those objects are used? And Why do 
> they take so much memory?
> Or It would be great if someone could offer any further debugging tips on the 
> latency or GC issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to