[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2024-06-05 Thread Brad Schoening (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852522#comment-17852522
 ] 

Brad Schoening commented on CASSANDRA-18762:


[~brandon.williams] we haven't deployed 4.1.5 in production yet, which is where 
we had seen issues, but it seems promising.

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Attachments: Cluster-dm-metrics-1.PNG, 
> image-2023-12-06-15-28-05-459.png, image-2023-12-06-15-29-31-491.png, 
> image-2023-12-06-15-58-55-007.png
>
>
> We are seeing repeated failures of nodes with 16GB of heap on a VM with 32GB 
> of physical RAM due to direct memory.  This seems to be related to 
> CASSANDRA-15202 which moved Merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6 with Java 11.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is 
> done here{noformat}
>  
> -XX:+AlwaysPreTouch
> -XX:+CrashOnOutOfMemoryError
> -XX:+ExitOnOutOfMemoryError
> -XX:+HeapDumpOnOutOfMemoryError
> -XX:+ParallelRefProcEnabled
> -XX:+PerfDisableSharedMem
> -XX:+ResizeTLAB

[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2024-06-05 Thread Brandon Williams (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852388#comment-17852388
 ] 

Brandon Williams commented on CASSANDRA-18762:
--

Does CASSANDRA-19336 not solve this?

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Attachments: Cluster-dm-metrics-1.PNG, 
> image-2023-12-06-15-28-05-459.png, image-2023-12-06-15-29-31-491.png, 
> image-2023-12-06-15-58-55-007.png
>
>
> We are seeing repeated failures of nodes with 16GB of heap on a VM with 32GB 
> of physical RAM due to direct memory.  This seems to be related to 
> CASSANDRA-15202 which moved Merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6 with Java 11.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is 
> done here{noformat}
>  
> -XX:+AlwaysPreTouch
> -XX:+CrashOnOutOfMemoryError
> -XX:+ExitOnOutOfMemoryError
> -XX:+HeapDumpOnOutOfMemoryError
> -XX:+ParallelRefProcEnabled
> -XX:+PerfDisableSharedMem
> -XX:+ResizeTLAB
> -XX:+UseG1GC
> -XX:+UseNUMA
> -XX:+UseTLAB
> -XX:+UseThreadPriorities
> 

[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2024-06-04 Thread Brad Schoening (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852270#comment-17852270
 ] 

Brad Schoening commented on CASSANDRA-18762:


Note, using the JVM option 
{code:java}
-Dio.netty.leakDetection.level=advanced{code}
 might help to diagnose this.

 

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Attachments: Cluster-dm-metrics-1.PNG, 
> image-2023-12-06-15-28-05-459.png, image-2023-12-06-15-29-31-491.png, 
> image-2023-12-06-15-58-55-007.png
>
>
> We are seeing repeated failures of nodes with 16GB of heap on a VM with 32GB 
> of physical RAM due to direct memory.  This seems to be related to 
> CASSANDRA-15202 which moved Merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6 with Java 11.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is 
> done here{noformat}
>  
> -XX:+AlwaysPreTouch
> -XX:+CrashOnOutOfMemoryError
> -XX:+ExitOnOutOfMemoryError
> -XX:+HeapDumpOnOutOfMemoryError
> -XX:+ParallelRefProcEnabled
> -XX:+PerfDisableSharedMem
> -XX:+ResizeTLAB
> 

[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2024-03-21 Thread Manish Khandelwal (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829743#comment-17829743
 ] 

Manish Khandelwal commented on CASSANDRA-18762:
---

I think reason for getting OOM here is related to same reasoning as mentioned 
in https://issues.apache.org/jira/browse/CASSANDRA-19336. I applied the patch 
for https://issues.apache.org/jira/browse/CASSANDRA-19336 and all full repairs 
with -pr on keyspace were successful.

As without this patch in one repair we can see almost 240 sessions triggered ( 
vnode:256, 11*11 cluster), resulting in 240*6 merkle tree requests for one 
table. For a keywpace with 3 tables this number was astonishing 240*6*3 
resulting in direct byte buffer within a minute of running.

After applying the patch repairs ran without issue also no memory pressue.

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Attachments: Cluster-dm-metrics-1.PNG, 
> image-2023-12-06-15-28-05-459.png, image-2023-12-06-15-29-31-491.png, 
> image-2023-12-06-15-58-55-007.png
>
>
> We are seeing repeated failures of nodes with 16GB of heap on a VM with 32GB 
> of physical RAM due to direct memory.  This seems to be related to 
> CASSANDRA-15202 which moved Merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6 with Java 11.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> 

[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2024-03-06 Thread Brad Schoening (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824144#comment-17824144
 ] 

Brad Schoening commented on CASSANDRA-18762:


It seems like if allocate fails in deserializeOffHeap due to lack of off heap 
memory, it could fall back to deserializeOnHeap

 

{{    private static ByteBuffer allocate(int innerNodeCount, IPartitioner 
partitioner)}}
{{    {}}
{{        int size = offHeapBufferSize(innerNodeCount, partitioner);}}
{{        logger.debug("Allocating direct buffer of size {} for an off-heap 
merkle tree", size);}}
{{        ByteBuffer buffer = ByteBuffer.allocateDirect(size);}}
{{        if (Ref.DEBUG_ENABLED)}}
{{            MemoryUtil.setAttachment(buffer, new Ref.DirectBufferRef<>(null, 
null));}}
{{        return buffer;}}
{{    }}}{{    }}

 

{{    private static Node deserializeTree(DataInputPlus in, IPartitioner 
partitioner, int innerNodeCount, boolean offHeapRequested, int version) throws 
IOException}}
{{    {}}
{{        return shouldUseOffHeapTrees(partitioner, offHeapRequested)}}
{{             ? deserializeOffHeap(in, partitioner, innerNodeCount, version)}}
{{             : OnHeapNode.deserialize(in, partitioner, version);}}
{{    }}}

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Attachments: Cluster-dm-metrics-1.PNG, 
> image-2023-12-06-15-28-05-459.png, image-2023-12-06-15-29-31-491.png, 
> image-2023-12-06-15-58-55-007.png
>
>
> We are seeing repeated failures of nodes with 16GB of heap on a VM with 32GB 
> of physical RAM due to direct memory.  This seems to be related to 
> CASSANDRA-15202 which moved Merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6 with Java 11.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> 

[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2024-02-26 Thread Brad Schoening (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820761#comment-17820761
 ] 

Brad Schoening commented on CASSANDRA-18762:


[~manmagic3] yes, we are using vnodes where num_tokens = 16.

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Attachments: Cluster-dm-metrics-1.PNG, 
> image-2023-12-06-15-28-05-459.png, image-2023-12-06-15-29-31-491.png, 
> image-2023-12-06-15-58-55-007.png
>
>
> We are seeing repeated failures of nodes with 16GB of heap on a VM with 32GB 
> of physical RAM due to direct memory.  This seems to be related to 
> CASSANDRA-15202 which moved Merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6 with Java 11.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is 
> done here{noformat}
>  
> -XX:+AlwaysPreTouch
> -XX:+CrashOnOutOfMemoryError
> -XX:+ExitOnOutOfMemoryError
> -XX:+HeapDumpOnOutOfMemoryError
> -XX:+ParallelRefProcEnabled
> -XX:+PerfDisableSharedMem
> -XX:+ResizeTLAB
> -XX:+UseG1GC
> -XX:+UseNUMA
> -XX:+UseTLAB
> 

[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2024-02-26 Thread Manish Khandelwal (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820725#comment-17820725
 ] 

Manish Khandelwal commented on CASSANDRA-18762:
---

[~bschoeni] were vnodes enabled for 4DC cluster when you run parallel repair 
and getting Direct buffer OOM. Also what was the value of vnodes?

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Attachments: Cluster-dm-metrics-1.PNG, 
> image-2023-12-06-15-28-05-459.png, image-2023-12-06-15-29-31-491.png, 
> image-2023-12-06-15-58-55-007.png
>
>
> We are seeing repeated failures of nodes with 16GB of heap on a VM with 32GB 
> of physical RAM due to direct memory.  This seems to be related to 
> CASSANDRA-15202 which moved Merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6 with Java 11.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is 
> done here{noformat}
>  
> -XX:+AlwaysPreTouch
> -XX:+CrashOnOutOfMemoryError
> -XX:+ExitOnOutOfMemoryError
> -XX:+HeapDumpOnOutOfMemoryError
> -XX:+ParallelRefProcEnabled
> 

[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2024-02-15 Thread Manish Khandelwal (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817832#comment-17817832
 ] 

Manish Khandelwal commented on CASSANDRA-18762:
---

Another update setting --XX:MaxDirectMemorySize=10G to higher value (more than 
heap which is 8G) resulted in running repairs successfully on multiple nodes. 
But failure is still happening on some nodes. Will evaluate  CASSANDRA-19336  
but description says while running repairs without -pr thats why ignored it 
first as we are using full repairs with -pr option.

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Attachments: Cluster-dm-metrics-1.PNG, 
> image-2023-12-06-15-28-05-459.png, image-2023-12-06-15-29-31-491.png, 
> image-2023-12-06-15-58-55-007.png
>
>
> We are seeing repeated failures of nodes with 16GB of heap on a VM with 32GB 
> of physical RAM due to direct memory.  This seems to be related to 
> CASSANDRA-15202 which moved Merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6 with Java 11.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at 

[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2024-02-15 Thread Brandon Williams (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817645#comment-17817645
 ] 

Brandon Williams commented on CASSANDRA-18762:
--

bq. Tried setting -XX:MaxDirectMemorySize but results are same

Then this is probably not your issue, and it is more likely something like 
CASSANDRA-19336.

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Attachments: Cluster-dm-metrics-1.PNG, 
> image-2023-12-06-15-28-05-459.png, image-2023-12-06-15-29-31-491.png, 
> image-2023-12-06-15-58-55-007.png
>
>
> We are seeing repeated failures of nodes with 16GB of heap on a VM with 32GB 
> of physical RAM due to direct memory.  This seems to be related to 
> CASSANDRA-15202 which moved Merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6 with Java 11.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is 
> done here{noformat}
>  
> -XX:+AlwaysPreTouch
> -XX:+CrashOnOutOfMemoryError
> -XX:+ExitOnOutOfMemoryError
> -XX:+HeapDumpOnOutOfMemoryError
> -XX:+ParallelRefProcEnabled
> 

[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2024-02-14 Thread Manish Khandelwal (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817590#comment-17817590
 ] 

Manish Khandelwal commented on CASSANDRA-18762:
---

We are also getting the same issue on multi DC setup. Though in single DC 
things run fine for 11 nodes. But once another DC is addded it starts to fail 
pretty quickly. Getting the same error as mentioned in the issue here. Running 
repair table wise seems to be successful most of the times. But on keyspace 
level repairs always fails for one of the keyspace. This keyspace has three 
tables, all STCS with one table having almost no data. Tried setting 
*-XX:MaxDirectMemorySize* but results are same, i.e., getting out of memory. We 
are on java8. and Cassandra 4.0.10. I think with multi DC should be easy to 
reproduce.

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Attachments: Cluster-dm-metrics-1.PNG, 
> image-2023-12-06-15-28-05-459.png, image-2023-12-06-15-29-31-491.png, 
> image-2023-12-06-15-58-55-007.png
>
>
> We are seeing repeated failures of nodes with 16GB of heap on a VM with 32GB 
> of physical RAM due to direct memory.  This seems to be related to 
> CASSANDRA-15202 which moved Merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6 with Java 11.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> 

[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2024-02-14 Thread Brad Schoening (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817421#comment-17817421
 ] 

Brad Schoening commented on CASSANDRA-18762:


It seems setting -XX:MaxDirectMemorySize might be useful to prevent this.

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Attachments: Cluster-dm-metrics-1.PNG, 
> image-2023-12-06-15-28-05-459.png, image-2023-12-06-15-29-31-491.png, 
> image-2023-12-06-15-58-55-007.png
>
>
> We are seeing repeated failures of nodes with 16GB of heap on a VM with 32GB 
> of physical RAM due to direct memory.  This seems to be related to 
> CASSANDRA-15202 which moved Merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6 with Java 11.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is 
> done here{noformat}
>  
> -XX:+AlwaysPreTouch
> -XX:+CrashOnOutOfMemoryError
> -XX:+ExitOnOutOfMemoryError
> -XX:+HeapDumpOnOutOfMemoryError
> -XX:+ParallelRefProcEnabled
> -XX:+PerfDisableSharedMem
> -XX:+ResizeTLAB
> -XX:+UseG1GC
> -XX:+UseNUMA
> -XX:+UseTLAB
> 

[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2023-12-06 Thread Brad Schoening (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793935#comment-17793935
 ] 

Brad Schoening commented on CASSANDRA-18762:


An update: we are still seeing this occur on a cluster. They have configured 
native_transport_max_thread = 256. A large number of repair Merkle trees 
precedes the OOM crash.

System: Apache Cassandra 4.0.10, a 16GB heap, 64GB RAM, 8 vCPUs

file_cache_size_in_mb =4096, G1HeapRegionSize=16M

!image-2023-12-06-15-58-55-007.png!

above graph is missing time ticks, but the spike occurs at 06:16:00

!image-2023-12-06-15-29-31-491.png!

 

Summary of the cassandra log:

11:17:10,289  [INFO ]  RepairSession.java:202 - [repair 
#838c24c0-935f-11ee-97ba-d79b6a12ccbe] Received merkle tree for table1 from ... 
 [repeated 35 times]
11:17:17,155 [INFO ] [Service Thread] cluster_id=99 ip_address=10.0.0.1  
GCInspector.java:294 - G1 Old Generation GC in 694ms.  G1 Eden Space: 
8925478912 -> 0; G1 Old Gen: 2196360784 -> 1133473904; G1 Survivor Space: 
385875968 -> 0; 
11:17:17,668 [INFO ] [Service Thread] cluster_id=99 ip_address=10.0.0.1  
GCInspector.java[repeated 35 times]:294 - G1 Old Generation GC in 505ms.  G1 
Old Gen: 1133473904 -> 1133526408; 
11:17:22,420 [INFO ] [ScheduledTasks:1] cluster_id=99 ip_address=10.0.0.1  
NoSpamLogger.java:92 - Some operations were slow, details available at debug 
level (debug.log)
11:17:22,417 [INFO ] [Service Thread] cluster_id=99 ip_address=10.0.0.1  
GCInspector.java:294 - G1 Old Generation GC in 787ms.  G1 Eden Space: 16777216 
-> 0; G1 Old Gen: 1133526408 -> 1133545448; 
11:17:23,213 [WARN ] [Service Thread] cluster_id=99 ip_address=10.0.0.1  
GCInspector.java:292 - G1 Old Generation GC in 4742ms.  G1 Old Gen: 1133545448 
-> 1133581144; 
11:17:23,217 [INFO ] [Service Thread] cluster_id=99 ip_address=10.0.0.1  
StatusLogger.java:65  [elided]
11:17:24,114 [INFO ] [Service Thread] cluster_id=99 ip_address=10.0.0.1  
GCInspector.java:294 - G1 Old Generation GC in 853ms.  G1 Eden Space: 117440512 
-> 0; G1 Old Gen: 1133747360 -> 1133758448; 
11:17:24,564 [ERROR] [Messaging-EventLoop-3-5] cluster_id=99 
ip_address=10.0.0.1  JVMStabilityInspector.java:102 - OutOfMemory error letting 
the JVM handle the error:
java.lang.OutOfMemoryError: Direct buffer memory
    at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
    at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
    at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)

    ... etc

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Attachments: Cluster-dm-metrics-1.PNG, 
> image-2023-12-06-15-28-05-459.png, image-2023-12-06-15-29-31-491.png, 
> image-2023-12-06-15-58-55-007.png
>
>
> We are seeing repeated failures of nodes with 16GB of heap and the same size 
> (16GB) for direct memory (derived from -Xms).  This seems to be related to 
> CASSANDRA-15202 which moved merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at 

[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2023-11-15 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17786596#comment-17786596
 ] 

Paulo Motta commented on CASSANDRA-18762:
-

Thanks for the follow-up. I will close this for now, please re-open if you 
observe the issue after 4.0.10.

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: Cluster-dm-metrics-1.PNG
>
>
> We are seeing repeated failures of nodes with 16GB of heap and the same size 
> (16GB) for direct memory (derived from -Xms).  This seems to be related to 
> CASSANDRA-15202 which moved merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is 
> done here{noformat}
>  
> -XX:+AlwaysPreTouch
> -XX:+CrashOnOutOfMemoryError
> -XX:+ExitOnOutOfMemoryError
> -XX:+HeapDumpOnOutOfMemoryError
> -XX:+ParallelRefProcEnabled
> -XX:+PerfDisableSharedMem
> -XX:+ResizeTLAB
> -XX:+UseG1GC
> -XX:+UseNUMA
> -XX:+UseTLAB
> -XX:+UseThreadPriorities
> 

[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2023-11-15 Thread Brad Schoening (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17786594#comment-17786594
 ] 

Brad Schoening commented on CASSANDRA-18762:


[~paulo] I have not been able to confirm for certain this resolved the issue 
because users with the problem on 4.0.6 are reluctant to upgrade (FUD).  But I 
have yet to see the issue in the 100+ prod clusters we have on 4.0.10 and I am 
inclined to close this after 3 months of successful experience with the memory 
allocation changes in 4.0.6 -> 4.0.10.

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: Cluster-dm-metrics-1.PNG
>
>
> We are seeing repeated failures of nodes with 16GB of heap and the same size 
> (16GB) for direct memory (derived from -Xms).  This seems to be related to 
> CASSANDRA-15202 which moved merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is 
> done here{noformat}
>  
> -XX:+AlwaysPreTouch
> 

[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2023-11-15 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17786588#comment-17786588
 ] 

Paulo Motta commented on CASSANDRA-18762:
-

[~bschoeni] Did you confirm CASSANDRA-16681 fixes this issue?

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: Cluster-dm-metrics-1.PNG
>
>
> We are seeing repeated failures of nodes with 16GB of heap and the same size 
> (16GB) for direct memory (derived from -Xms).  This seems to be related to 
> CASSANDRA-15202 which moved merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is 
> done here{noformat}
>  
> -XX:+AlwaysPreTouch
> -XX:+CrashOnOutOfMemoryError
> -XX:+ExitOnOutOfMemoryError
> -XX:+HeapDumpOnOutOfMemoryError
> -XX:+ParallelRefProcEnabled
> -XX:+PerfDisableSharedMem
> -XX:+ResizeTLAB
> -XX:+UseG1GC
> -XX:+UseNUMA
> -XX:+UseTLAB
> -XX:+UseThreadPriorities
> -XX:-UseBiasedLocking
> 

[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2023-08-17 Thread Brad Schoening (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755650#comment-17755650
 ] 

Brad Schoening commented on CASSANDRA-18762:


We're going to test this change in 4.0.7:

4.0.7
* Fix multiple BufferPool bugs (CASSANDRA-16681)

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: Cluster-dm-metrics-1.PNG
>
>
> We are seeing repeated failures of nodes with 16GB of heap and the same size 
> (16GB) for direct memory (derived from -Xms).  This seems to be related to 
> CASSANDRA-15202 which moved merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is 
> done here{noformat}
>  
> -XX:+AlwaysPreTouch
> -XX:+CrashOnOutOfMemoryError
> -XX:+ExitOnOutOfMemoryError
> -XX:+HeapDumpOnOutOfMemoryError
> -XX:+ParallelRefProcEnabled
> -XX:+PerfDisableSharedMem
> -XX:+ResizeTLAB
> -XX:+UseG1GC
> -XX:+UseNUMA
> -XX:+UseTLAB
> -XX:+UseThreadPriorities
> 

[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2023-08-15 Thread Brad Schoening (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754776#comment-17754776
 ] 

Brad Schoening commented on CASSANDRA-18762:


83 user tables on one cluster, just 9 on the other.  Each have 4 DCs and using 
parallel repair.

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Attachments: Cluster-dm-metrics-1.PNG
>
>
> We are seeing repeated failures of nodes with 16GB of heap and the same size 
> (16GB) for direct memory (derived from -Xms).  This seems to be related to 
> CASSANDRA-15202 which moved merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is 
> done here{noformat}
>  
> -XX:+AlwaysPreTouch
> -XX:+CrashOnOutOfMemoryError
> -XX:+ExitOnOutOfMemoryError
> -XX:+HeapDumpOnOutOfMemoryError
> -XX:+ParallelRefProcEnabled
> -XX:+PerfDisableSharedMem
> -XX:+ResizeTLAB
> -XX:+UseG1GC
> -XX:+UseNUMA
> -XX:+UseTLAB
> -XX:+UseThreadPriorities
> -XX:-UseBiasedLocking
> 

[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2023-08-15 Thread Brandon Williams (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754774#comment-17754774
 ] 

Brandon Williams commented on CASSANDRA-18762:
--

If you have a lot of tables this may be CASSANDRA-17787.  What kind of repair 
are you performing?

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Attachments: Cluster-dm-metrics-1.PNG
>
>
> We are seeing repeated failures of nodes with 16GB of heap and the same size 
> (16GB) for direct memory (derived from -Xms).  This seems to be related to 
> CASSANDRA-15202 which moved merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is 
> done here{noformat}
>  
> -XX:+AlwaysPreTouch
> -XX:+CrashOnOutOfMemoryError
> -XX:+ExitOnOutOfMemoryError
> -XX:+HeapDumpOnOutOfMemoryError
> -XX:+ParallelRefProcEnabled
> -XX:+PerfDisableSharedMem
> -XX:+ResizeTLAB
> -XX:+UseG1GC
> -XX:+UseNUMA
> -XX:+UseTLAB
> -XX:+UseThreadPriorities
> -XX:-UseBiasedLocking
>