[ 
https://issues.apache.org/jira/browse/CASSANDRA-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781711#comment-16781711
 ] 

Benedict commented on CASSANDRA-15006:
--------------------------------------

Thanks [~jborgstrom].

After some painful VisualVM usage (OQL is powerful but horrible), it looks like 
my initial thoughts were on the money:
 # sum(heap.objects('java.nio.DirectByteBuffer', 'true', '!isFileBacked(it) && 
isOwnerOfMemory(it)'), 'it.capacity') 
 ** Total DirectByteBuffer capacity where the buffer is not a slice of another 
buffer and is not backed by a file descriptor
 ** 25th: 5.8236923E8
 ** 29th: 6.39534433E8
 # sum(heap.objects('java.nio.DirectByteBuffer', 'true', '!isFileBacked(it) && 
isOwnerOfMemory(it) && isInNettyPool(it)'), 'it.capacity')
 ** Total DirectByteBuffer capacity where the buffer is in Netty's pool
 ** 25th: 3.3554432E7
 ** 29th: 3.3554432E7
 # sum(heap.objects('java.nio.DirectByteBuffer', 'true', '!isFileBacked(it) && 
isOwnerOfMemory(it) && !isInChunkCache(it) && isHintsBuffer(it)'), 
'it.capacity')
 ** Total DirectByteBuffer capacity where the buffer is use for Hints
 ** 25th: 3.3554432E7
 ** 29th: 3.3554432E7
 # sum(heap.objects('java.nio.DirectByteBuffer', 'true', '!isFileBacked(it) && 
isOwnerOfMemory(it) && isMaybeMacroChunk(it)'), 'it.capacity')
 ** Total DirectByteBuffer capacity where the buffer is very likely a 
{{BufferPool}} macro chunk
 ** 25th: 5.14756608E8
 ** 29th: 5.33704704E8
 # sum(heap.objects('java.nio.DirectByteBuffer', 'true', '!isFileBacked(it) && 
isOwnerOfMemory(it) && isInChunkCache(it)'), 'it.capacity')
 ** Total DirectByteBuffer capacity where the buffer is in the chunk cache, but 
is not managed by the BufferPool
 ** 25th: 0
 ** 29th: 3.8076416E7
 # sum(heap.objects('java.nio.DirectByteBuffer', 'true', '!isFileBacked(it) && 
isOwnerOfMemory(it) && !isMaybeMacroChunk(it) && !isInNettyPoolOrChunkCache(it) 
&& !isHintsBuffer(it)'), 'it.capacity')
 ** Total DirectByteBuffer capacity that is not explained by one of the above, 
and is not use for hints (which uses a stable 32MiB)
 ** 25th: 503758.0
 ** 29th: 644449.0
 # sum(heap.objects('java.nio.DirectByteBuffer', 'true', '!isFileBacked(it) && 
!isOwnerOfMemory(it) && isInChunkCache(it)'), 'it.capacity')
 ** Total DirectByteBuffer capacity where the buffer is in the chunk cache, and 
_is_ managed by the BufferPool
 ** 25th: 4.72383488E8
 ** 29th: 4.10779648E8

So, basically, the ChunkCache is beginning to allocate memory directly because 
the BufferPool has run out of space.  It has run out of space because it was 
never intended to be used for objects with arbitrary lifetimes.

This was already on my radar as something to address, but it won't be addressed 
for a couple of months I expect, and I don't know which versions will be 
targeted for a fix.  It should be that the 3.0.x line does not have this 
problem.  If you have yet to go live, I would recommend using 3.0.x.  
Otherwise, lower your chunk cache and buffer pool settings.

> Possible java.nio.DirectByteBuffer leak
> ---------------------------------------
>
>                 Key: CASSANDRA-15006
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15006
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: cassandra: 3.11.3
> jre: openjdk version "1.8.0_181"
> heap size: 2GB
> memory limit: 3GB (cgroup)
> I started one of the nodes with "-Djdk.nio.maxCachedBufferSize=262144" but 
> that did not seem to make any difference.
>            Reporter: Jonas Borgström
>            Priority: Major
>         Attachments: CASSANDRA-15006-reference-chains.png, 
> Screenshot_2019-02-04 Grafana - Cassandra.png, Screenshot_2019-02-14 Grafana 
> - Cassandra(1).png, Screenshot_2019-02-14 Grafana - Cassandra.png, 
> Screenshot_2019-02-15 Grafana - Cassandra.png, Screenshot_2019-02-22 Grafana 
> - Cassandra.png, Screenshot_2019-02-25 Grafana - Cassandra.png, 
> cassandra.yaml, cmdline.txt
>
>
> While testing a 3 node 3.11.3 cluster I noticed that the nodes were suddenly 
> killed by the Linux OOM killer after running without issues for 4-5 weeks.
> After enabling more metrics and leaving the nodes running for 12 days it sure 
> looks like the
> "java.nio:type=BufferPool,name=direct" Mbean shows a very linear growth 
> (approx 15MiB/24h, see attached screenshot). Is this expected to keep growing 
> linearly after 12 days with a constant load?
>  
> In my setup the growth/leak is about 15MiB/day so I guess in most setups it 
> would take quite a few days until it becomes noticeable. I'm able to see the 
> same type of slow growth in other production clusters even though the graph 
> data is more noisy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to