Ivar Thorson created CASSANDRA-9549:
---------------------------------------
Summary: Memory leak
Key: CASSANDRA-9549
URL: https://issues.apache.org/jira/browse/CASSANDRA-9549
Project: Cassandra
Issue Type: Bug
Components: Core
Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2
cores 7.5G memory, 800G platter for cassandra data, root partition and commit
log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1
replica/zone
JVM: /usr/java/jdk1.8.0_40/jre/bin/java
JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar
-XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities
-XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M
-XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler
-XX:CMSWaitDuration=10000 -XX:+CMSParallelInitialMarkEnabled
-XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=10000 -XX:+UseCondCardMark
-Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199
-Dcom.sun.management.jmxremote.rmi.port=7199
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra
-Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid
Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux
Reporter: Ivar Thorson
Priority: Critical
Attachments: cpu-load.png, memoryuse.png, suspect.png, two-loads.png
We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over
the period of a couple of days, eventually consumes all of the available JVM
heap space, putting the JVM into GC hell where it keeps trying CMS collection
but can't free up any heap space. This pattern happens for every node in our
cluster and is requiring rolling cassandra restarts just to keep the cluster
running. We have upgraded the cluster per Datastax docs from the 2.0 branch a
couple of months ago and have been using the data from this cluster for more
than a year without problem.
As the heap fills up with non-GC-able objects, the CPU/OS load average grows
along with it. Heap dumps reveal an increasing number of
java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps
over a 2 day period, and watched the number of Node objects go from 4M, to 19M,
to 36M, and eventually about 65M objects before the node stops responding. The
screen capture of our heap dump is from the 19M measurement.
Load on the cluster is minimal. We can see this effect even with only a handful
of writes per second. (See attachments for Opscenter snapshots during very
light loads and heavier loads). Even with only 5 reads a sec we see this
behavior.
Log files show repeated errors in Ref.java:181 and Ref.java:279 and "LEAK
detected" messages:
1. CompactionExecutor error
ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error
when closing class
org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150
java.util.concurrent.RejectedExecutionException: Task
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31
rejected from
org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated,
pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644]
ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK
DETECTED: a reference
(org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class
org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151
was not released before the reference was garbage collected
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)