[ https://issues.apache.org/jira/browse/CASSANDRA-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781779#comment-16781779 ]
Jonas Borgström commented on CASSANDRA-15006: --------------------------------------------- Thanks [~benedict]! Awesome work, your analysis sounds very reasonable! I checked the logs and unfortunately these servers only keep logs for 5 days so the logs from the node startups are since long lost. But I did find bunch of "INFO Maximum memory usage reached (531628032), cannot allocate chunk of 1048576" log entries. Pretty much one every hour on the hour. Which probably corresponding with time of the hourly cassandra snapshots taken on each node. Do you have any idea what is the source of these "objects with arbitrary lifetimes"? And why it (at least in my tests) appears to increase linearly forever. If they are related to repairs somehow I would assume that they would not increase from one repair to the next? Also, your proposed workaround for 3.11.x to lower the chunk cache and buffer pool settings. Would that "fix" the problem or simply buy some more time until the process runs out of memory. I guess instead of lowering these two settings simply raising the configured memory limit from 3GiB to 4 or 5 GiB without changing the heap size setting would work equally well? I have no problem with raising my (rather low) memory limit if I knew that I would end up with a setup that will not run out of memory no matter how long it will be running. Again, thanks for your help! > Possible java.nio.DirectByteBuffer leak > --------------------------------------- > > Key: CASSANDRA-15006 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15006 > Project: Cassandra > Issue Type: Bug > Environment: cassandra: 3.11.3 > jre: openjdk version "1.8.0_181" > heap size: 2GB > memory limit: 3GB (cgroup) > I started one of the nodes with "-Djdk.nio.maxCachedBufferSize=262144" but > that did not seem to make any difference. > Reporter: Jonas Borgström > Priority: Major > Attachments: CASSANDRA-15006-reference-chains.png, > Screenshot_2019-02-04 Grafana - Cassandra.png, Screenshot_2019-02-14 Grafana > - Cassandra(1).png, Screenshot_2019-02-14 Grafana - Cassandra.png, > Screenshot_2019-02-15 Grafana - Cassandra.png, Screenshot_2019-02-22 Grafana > - Cassandra.png, Screenshot_2019-02-25 Grafana - Cassandra.png, > cassandra.yaml, cmdline.txt > > > While testing a 3 node 3.11.3 cluster I noticed that the nodes were suddenly > killed by the Linux OOM killer after running without issues for 4-5 weeks. > After enabling more metrics and leaving the nodes running for 12 days it sure > looks like the > "java.nio:type=BufferPool,name=direct" Mbean shows a very linear growth > (approx 15MiB/24h, see attached screenshot). Is this expected to keep growing > linearly after 12 days with a constant load? > > In my setup the growth/leak is about 15MiB/day so I guess in most setups it > would take quite a few days until it becomes noticeable. I'm able to see the > same type of slow growth in other production clusters even though the graph > data is more noisy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org