[ 
https://issues.apache.org/jira/browse/CASSANDRA-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16777009#comment-16777009
 ] 

Jonas Borgström commented on CASSANDRA-15006:
---------------------------------------------

Yay, I think I finally found a way to get the direct memory usage to stop 
growing and level out!

I've just attached an updated screenshot. And as you can see the linear growth 
of "java.nio:type=BufferPool,name=direct/MemoryUsed" has finally stopped!

It appears that as soon as I disabled the hourly "nodetool repair --full" last 
Friday direct memory usage first dropped 10-20% and then leveled out.

But I don't really understand how running "repair --full" every hour can cause 
the direct memory usage to grow linearly. Each "repair --full" invocation only 
takes a couple of minutes so the GC would have plenty of time to reclaim any 
resources used long before the next "repair --full" job starts.

Could this perhaps be related to CASSANDRA-14096?

 

Also. I did not run "repair --full" every hour when I triggered the OOM error 
the first time. That setup only ran an incremental repair every 5 days. I only 
later started to run repairs more frequently to try to be able to reproduce the 
problem faster.

 

> Possible java.nio.DirectByteBuffer leak
> ---------------------------------------
>
>                 Key: CASSANDRA-15006
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15006
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: cassandra: 3.11.3
> jre: openjdk version "1.8.0_181"
> heap size: 2GB
> memory limit: 3GB (cgroup)
> I started one of the nodes with "-Djdk.nio.maxCachedBufferSize=262144" but 
> that did not seem to make any difference.
>            Reporter: Jonas Borgström
>            Priority: Major
>         Attachments: CASSANDRA-15006-reference-chains.png, 
> Screenshot_2019-02-04 Grafana - Cassandra.png, Screenshot_2019-02-14 Grafana 
> - Cassandra(1).png, Screenshot_2019-02-14 Grafana - Cassandra.png, 
> Screenshot_2019-02-15 Grafana - Cassandra.png, Screenshot_2019-02-22 Grafana 
> - Cassandra.png, Screenshot_2019-02-25 Grafana - 
> Cassandra.png.2019_02_25_16_21_30.0.svg, cassandra.yaml, cmdline.txt
>
>
> While testing a 3 node 3.11.3 cluster I noticed that the nodes were suddenly 
> killed by the Linux OOM killer after running without issues for 4-5 weeks.
> After enabling more metrics and leaving the nodes running for 12 days it sure 
> looks like the
> "java.nio:type=BufferPool,name=direct" Mbean shows a very linear growth 
> (approx 15MiB/24h, see attached screenshot). Is this expected to keep growing 
> linearly after 12 days with a constant load?
>  
> In my setup the growth/leak is about 15MiB/day so I guess in most setups it 
> would take quite a few days until it becomes noticeable. I'm able to see the 
> same type of slow growth in other production clusters even though the graph 
> data is more noisy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to