[jira] [Commented] (CASSANDRA-15006) Possible java.nio.DirectByteBuffer leak

JIRA Fri, 15 Feb 2019 07:54:37 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769406#comment-16769406
 ]


Jonas Borgström commented on CASSANDRA-15006:
---------------------------------------------

Hi [~benedict]. Sorry I just missed your reply when I wrote my own reply 
yesterday.
h4. About the graphs

All graphed values are from standard java/cassandra JMX metrics. The name of 
each graph should explain exactly which MBean is used and what attribute of 
that MBean that is plotted.

So for example the "java.nio:type=BufferPool,name=direct/MemoryUsed" graph 
plots the "MemoryUsed" attribute of the JMX Mbean with objectName 
"java.nio:type=BufferPool,name=direct". Let me know if I should explain any of 
the other graphs in more detail but they all follow the same naming pattern.
h5. Memory limit

I run cassandra using kubernetes. Kubernetes lets you define hard memory limits 
that are enforced using cgroups. When I initially ran into this issue I had a 
hard memory limit of 3GiB. But in order to see if this would eventually level 
out I changed that to 4GiB when started the current set of nodes.

How did you determine that the process has "much more than this (3GiB)" 
committed to it?

If we can assume that all "DirectByteBufferR" usage refer to mapped memory and 
not actual off-heap allocations I think the process still stays under the 4GiB 
for now. It just does not level out the way I expect.
h5. Log files and configuration

I'm afraid the log entries from the node startup have already been deleted but 
I've uploaded my cassandra.yaml and the command line used to start java.
h5. Table truncation

I did truncate a large table yesterday. And as expected the 
"java.nio:type=BufferPool,name=mapped/MemoryUsed" graphs shows that the amount 
of mapped memory immediately decreased from 6GiB to 0.6GiB corresponding well 
with the amount of data left on the system. This value also seems mostly 
unchanged now 24 hours later. Probably because not much new data has been 
written since yesterday.

The "java.nio:type=BufferPool,name=direct/MemoryUsed" graph also immediately 
dropped from 800+MiB to around 500MiB. Interestingly this value has now almost 
24h later almost made it's way back up to over 750MiB without that much more 
data being written to cassandra since yesterday.

I forgot one detail. In addition to a fairly standard CQL load (mostly selects 
and some inserts) I've also been running "while true; nodetool repair --full && 
sleep 1h; done" since day 1.
h5. Heap Dump

I've saved a number of heap dumps during the test (both before and after I 
truncated the table for example) so I'll try to work out a way to make them 
downloadable

> Possible java.nio.DirectByteBuffer leak
> ---------------------------------------
>
>                 Key: CASSANDRA-15006
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15006
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: cassandra: 3.11.3
> jre: openjdk version "1.8.0_181"
> heap size: 2GB
> memory limit: 3GB (cgroup)
> I started one of the nodes with "-Djdk.nio.maxCachedBufferSize=262144" but 
> that did not seem to make any difference.
>            Reporter: Jonas Borgström
>            Priority: Major
>         Attachments: CASSANDRA-15006-reference-chains.png, 
> Screenshot_2019-02-04 Grafana - Cassandra.png, Screenshot_2019-02-14 Grafana 
> - Cassandra(1).png, Screenshot_2019-02-14 Grafana - Cassandra.png, 
> Screenshot_2019-02-15 Grafana - Cassandra.png, cassandra.yaml, cmdline.txt
>
>
> While testing a 3 node 3.11.3 cluster I noticed that the nodes were suddenly 
> killed by the Linux OOM killer after running without issues for 4-5 weeks.
> After enabling more metrics and leaving the nodes running for 12 days it sure 
> looks like the
> "java.nio:type=BufferPool,name=direct" Mbean shows a very linear growth 
> (approx 15MiB/24h, see attached screenshot). Is this expected to keep growing 
> linearly after 12 days with a constant load?
>  
> In my setup the growth/leak is about 15MiB/day so I guess in most setups it 
> would take quite a few days until it becomes noticeable. I'm able to see the 
> same type of slow growth in other production clusters even though the graph 
> data is more noisy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-15006) Possible java.nio.DirectByteBuffer leak

Reply via email to