Brad Schoening created CASSANDRA-18762:
------------------------------------------
Summary: Repair triggers OOM with direct buffer memory
Key: CASSANDRA-18762
URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
Project: Cassandra
Issue Type: Bug
Reporter: Brad Schoening
Attachments: Cluster-dm-metrics-1.PNG
We are seeing repeated failures of nodes with 16GB of heap and the same size
(16GB) for direct memory derived from -Xms. This seems to be related to
[CASSANDRA-15202| https://issues.apache.org/jira/browse/CASSANDRA-15202] which
moved merkel trees off-heap in 4.0.
{noformat}
2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101
ip_address=169.0.0.1 RepairSession.java:202 - [repair
#5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from
/169.102.200.241:7000
2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101
ip_address=169.0.0.1 RepairSession.java:202 - [repair
#5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from
/169.93.192.29:7000
2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101
ip_address=169.0.0.1 RepairSession.java:202 - [repair
#5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from
/169.104.171.134:7000
2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101
ip_address=169.0.0.1 RepairSession.java:202 - [repair
#5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from
/169.79.232.67:7000
2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101
ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms.
Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; G1
Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; Metaspace:
80411136 -> 80176528
2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101
ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error letting
the JVM handle the error:
java.lang.OutOfMemoryError: Direct buffer memory
at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118)
at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
at org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
at
org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
at
org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
at
org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
at
org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
at
org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
at
org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
at
org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
at
org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is
done here{noformat}
-XX:+AlwaysPreTouch
-XX:+CrashOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ParallelRefProcEnabled
-XX:+PerfDisableSharedMem
-XX:+ResizeTLAB
-XX:+UseG1GC
-XX:+UseNUMA
-XX:+UseTLAB
-XX:+UseThreadPriorities
-XX:-UseBiasedLocking
-XX:CompileCommandFile=/opt/nosql/clusters/cassandra-101/conf/hotspot_compiler
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:G1ReservePercent=20
-XX:HeapDumpPath=/opt/nosql/data/cluster_101/cassandra-1691623098-pid2804737.hprof
-XX:InitiatingHeapOccupancyPercent=70
-XX:MaxGCPauseMillis=200
-XX:StringTableSize=60013
-Xlog:gc*:file=/opt/nosql/clusters/cassandra-101/logs/gc.log:time,uptime:filecount=10,filesize=10485760
-Xms16G
-Xmx16G
-Xss256k
>From our Prometheus metrics, the behavior shows the direct buffer memory
>ramping up until it reaches the max and then causes an OOM. It would appear
>that direct memory is never being released by the JVM until its exhausted.
!Cluster-dm-metrics.PNG!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]