[ 
https://issues.apache.org/jira/browse/CASSANDRA-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781779#comment-16781779
 ] 

Jonas Borgström commented on CASSANDRA-15006:
---------------------------------------------

Thanks [~benedict]! Awesome work, your analysis sounds very reasonable!

I checked the logs and unfortunately these servers only keep logs for 5 days so 
the logs from the node startups are since long lost.

But I did find bunch of "INFO Maximum memory usage reached (531628032), cannot 
allocate chunk of 1048576" log entries. Pretty much one every hour on the hour. 
Which probably corresponding with time of the hourly cassandra snapshots taken 
on each node.

Do you have any idea what is the source of these "objects with arbitrary 
lifetimes"? And why it (at least in my tests) appears to increase linearly 
forever. If they are related to repairs somehow I would assume that they would 
not increase from one repair to the next?

Also, your proposed workaround for 3.11.x to lower the chunk cache and buffer 
pool settings. Would that "fix" the problem or simply buy some more time until 
the process runs out of memory.

I guess instead of lowering these two settings simply raising the configured 
memory limit from 3GiB to 4 or 5 GiB without changing the heap size setting 
would work equally well?

I have no problem with raising my (rather low) memory limit if I knew that I 
would end up with a setup that will not run out of memory no matter how long it 
will be running.

Again, thanks for your help!

> Possible java.nio.DirectByteBuffer leak
> ---------------------------------------
>
>                 Key: CASSANDRA-15006
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15006
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: cassandra: 3.11.3
> jre: openjdk version "1.8.0_181"
> heap size: 2GB
> memory limit: 3GB (cgroup)
> I started one of the nodes with "-Djdk.nio.maxCachedBufferSize=262144" but 
> that did not seem to make any difference.
>            Reporter: Jonas Borgström
>            Priority: Major
>         Attachments: CASSANDRA-15006-reference-chains.png, 
> Screenshot_2019-02-04 Grafana - Cassandra.png, Screenshot_2019-02-14 Grafana 
> - Cassandra(1).png, Screenshot_2019-02-14 Grafana - Cassandra.png, 
> Screenshot_2019-02-15 Grafana - Cassandra.png, Screenshot_2019-02-22 Grafana 
> - Cassandra.png, Screenshot_2019-02-25 Grafana - Cassandra.png, 
> cassandra.yaml, cmdline.txt
>
>
> While testing a 3 node 3.11.3 cluster I noticed that the nodes were suddenly 
> killed by the Linux OOM killer after running without issues for 4-5 weeks.
> After enabling more metrics and leaving the nodes running for 12 days it sure 
> looks like the
> "java.nio:type=BufferPool,name=direct" Mbean shows a very linear growth 
> (approx 15MiB/24h, see attached screenshot). Is this expected to keep growing 
> linearly after 12 days with a constant load?
>  
> In my setup the growth/leak is about 15MiB/day so I guess in most setups it 
> would take quite a few days until it becomes noticeable. I'm able to see the 
> same type of slow growth in other production clusters even though the graph 
> data is more noisy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to