Ivar Thorson created CASSANDRA-9549:
---------------------------------------

             Summary: Memory leak 
                 Key: CASSANDRA-9549
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549
             Project: Cassandra
          Issue Type: Bug
          Components: Core
         Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 
cores 7.5G memory, 800G platter for cassandra data, root partition and commit 
log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 
replica/zone

JVM: /usr/java/jdk1.8.0_40/jre/bin/java 
JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar 
-XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities 
-XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M 
-XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled 
-XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly 
-XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler 
-XX:CMSWaitDuration=10000 -XX:+CMSParallelInitialMarkEnabled 
-XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=10000 -XX:+UseCondCardMark 
-Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 
-Dcom.sun.management.jmxremote.rmi.port=7199 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra 
-Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid 

Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux


            Reporter: Ivar Thorson
            Priority: Critical
         Attachments: cpu-load.png, memoryuse.png, suspect.png, two-loads.png

We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over 
the period of a couple of days, eventually consumes all of the available JVM 
heap space, putting the JVM into GC hell where it keeps trying CMS collection 
but can't free up any heap space. This pattern happens for every node in our 
cluster and is requiring rolling cassandra restarts just to keep the cluster 
running. We have upgraded the cluster per Datastax docs from the 2.0 branch a 
couple of months ago and have been using the data from this cluster for more 
than a year without problem.

As the heap fills up with non-GC-able objects, the CPU/OS load average grows 
along with it. Heap dumps reveal an increasing number of 
java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps 
over a 2 day period, and watched the number of Node objects go from 4M, to 19M, 
to 36M, and eventually about 65M objects before the node stops responding. The 
screen capture of our heap dump is from the 19M measurement.

Load on the cluster is minimal. We can see this effect even with only a handful 
of writes per second. (See attachments for Opscenter snapshots during very 
light loads and heavier loads). Even with only 5 reads a sec we see this 
behavior.

Log files show repeated errors in Ref.java:181 and Ref.java:279 and "LEAK 
detected" messages:

1. CompactionExecutor error

ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error 
when closing class 
org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150
java.util.concurrent.RejectedExecutionException: Task 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 
rejected from 
org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated,
 pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644]

ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK 
DETECTED: a reference 
(org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class 
org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151
 was not released before the reference was garbage collected





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to