[jira] [Comment Edited] (CASSANDRA-20226) Reduce contention in NativeAllocator.allocate

Dmitry Konstantinov (Jira) Sat, 18 Oct 2025 13:01:01 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18028335#comment-18028335
 ]


Dmitry Konstantinov edited comment on CASSANDRA-20226 at 10/8/25 11:38 AM:
---------------------------------------------------------------------------

Draft MR: [https://github.com/apache/cassandra/pull/4415]
h2. Perf test results
h3. Load

1 partition text column, 1 clustering text column, 5 value text columns, 
inserts are done using 10-row batches.

cassandra-stress "user profile=./batch_profile.yaml no-warmup 
ops(insert=1,partition-select=0) n=10m" -rate threads=100 -node <IP>
h3. Test environment
 * 1 cassandra server ode = m8i.4xlarge (16 vCPU, 64 GiB RAM, EBS)
 * cassandra-stress = c5.9xlarge

h3. Test results

!test_results_m8i.4xlarge_offheap_objects.png|width=800!

[^test_results_m8i.4xlarge_offheap_objects.html]
h4. Before
{code:java}
Results:
Op rate                   :   87,199 op/s  [insert: 87,199 op/s]
Partition rate            :   87,199 pk/s  [insert: 87,199 pk/s]
Row rate                  :  871,988 row/s [insert: 871,988 row/s]
Latency mean              :    1.1 ms [insert: 1.1 ms]
Latency median            :    0.9 ms [insert: 0.9 ms]
Latency 95th percentile   :    2.1 ms [insert: 2.1 ms]
Latency 99th percentile   :    4.2 ms [insert: 4.2 ms]
Latency 99.9th percentile :   22.1 ms [insert: 22.1 ms]
Latency max               :  270.8 ms [insert: 270.8 ms]
Total partitions          : 10,000,000 [insert: 10,000,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0 B
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:01:54
{code}
h4. Before
{code:java}
Results:
Op rate                   :  139,918 op/s  [insert: 139,918 op/s]
Partition rate            :  139,918 pk/s  [insert: 139,918 pk/s]
Row rate                  : 1,399,177 row/s [insert: 1,399,177 row/s]
Latency mean              :    0.7 ms [insert: 0.7 ms]
Latency median            :    0.5 ms [insert: 0.5 ms]
Latency 95th percentile   :    0.9 ms [insert: 0.9 ms]
Latency 99th percentile   :    2.2 ms [insert: 2.2 ms]
Latency 99.9th percentile :   22.2 ms [insert: 22.2 ms]
Latency max               :  290.5 ms [insert: 290.5 ms]
Total partitions          : 10,000,000 [insert: 10,000,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0 B
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:01:11
{code}
h3. Configuration changes

The changes are applied for both tests, before and after:
{code:java}
-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints # to improve accuracy of 
async profiler
-Dio.netty.eventLoopThreads=2 # we do not need that than many as default = 2 * 
CPU cores actually
memtable_allocation_type: offheap_objects
memtable:
  configurations:
    skiplist:
      class_name: SkipListMemtable
    trie:
      class_name: TrieMemtable
      parameters:
             shards: 32
    default:
      inherits: trie 

commitlog_disk_access_mode: direct

native_transport_max_request_data_in_flight: 1024MiB
native_transport_max_request_data_in_flight_per_ip: 1024MiB
{code}
commit log is on a separate volume (to not be a bottleneck)
compaction is +disabled+


was (Author: dnk):
Draft MR: [https://github.com/apache/cassandra/pull/4415]
h2. Perf test results

h3. Load
1 partition text column, 1 clustering text column, 5 value text columns, 
inserts are done using 10-row batches.

cassandra-stress "user profile=./batch_profile.yaml no-warmup 
ops(insert=1,partition-select=0) n=10m" -rate threads=100 -node <IP>

h3. Test environment
 * 1 cassandra server ode = m8i.4xlarge (16 vCPU, 64 GiB RAM, EBS)
 * cassandra-stress = c5.9xlarge

h3. Test results
[^test_results_m8i.4xlarge_offheap_objects.html]

h4. Before
{code:java}
Results:
Op rate                   :   87,199 op/s  [insert: 87,199 op/s]
Partition rate            :   87,199 pk/s  [insert: 87,199 pk/s]
Row rate                  :  871,988 row/s [insert: 871,988 row/s]
Latency mean              :    1.1 ms [insert: 1.1 ms]
Latency median            :    0.9 ms [insert: 0.9 ms]
Latency 95th percentile   :    2.1 ms [insert: 2.1 ms]
Latency 99th percentile   :    4.2 ms [insert: 4.2 ms]
Latency 99.9th percentile :   22.1 ms [insert: 22.1 ms]
Latency max               :  270.8 ms [insert: 270.8 ms]
Total partitions          : 10,000,000 [insert: 10,000,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0 B
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:01:54
{code}
h4. Before
{code:java}
Results:
Op rate                   :  139,918 op/s  [insert: 139,918 op/s]
Partition rate            :  139,918 pk/s  [insert: 139,918 pk/s]
Row rate                  : 1,399,177 row/s [insert: 1,399,177 row/s]
Latency mean              :    0.7 ms [insert: 0.7 ms]
Latency median            :    0.5 ms [insert: 0.5 ms]
Latency 95th percentile   :    0.9 ms [insert: 0.9 ms]
Latency 99th percentile   :    2.2 ms [insert: 2.2 ms]
Latency 99.9th percentile :   22.2 ms [insert: 22.2 ms]
Latency max               :  290.5 ms [insert: 290.5 ms]
Total partitions          : 10,000,000 [insert: 10,000,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0 B
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:01:11
{code}

h3. Configuration changes

The changes are applied for both tests, before and after:
{code:java}
-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints # to improve accuracy of 
async profiler
-Dio.netty.eventLoopThreads=2 # we do not need that than many as default = 2 * 
CPU cores actually
memtable_allocation_type: offheap_objects
memtable:
  configurations:
    skiplist:
      class_name: SkipListMemtable
    trie:
      class_name: TrieMemtable
      parameters:
             shards: 32
    default:
      inherits: trie 

commitlog_disk_access_mode: direct

native_transport_max_request_data_in_flight: 1024MiB
native_transport_max_request_data_in_flight_per_ip: 1024MiB
{code}
commit log is on a separate volume (to not be a bottleneck)
compaction is +disabled+

> Reduce contention in NativeAllocator.allocate
> ---------------------------------------------
>
>                 Key: CASSANDRA-20226
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20226
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Local/Memtable
>            Reporter: Dmitry Konstantinov
>            Assignee: Dmitry Konstantinov
>            Priority: Normal
>             Fix For: 5.x
>
>         Attachments: 5.1_batch_LongAdder.html, 5.1_batch_addAndGet.html, 
> 5.1_batch_alloc_batching.html, 5.1_batch_baseline.html, 
> 5.1_batch_pad_allocated.html, cpu_profile_batch.html, 
> image-2025-01-20-23-38-58-896.png, profile.yaml, 
> test_results_m8i.4xlarge_offheap_objects.html, 
> test_results_m8i.4xlarge_offheap_objects.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> For a high insert batch rate it looks like we have a bottleneck in 
> NativeAllocator.allocate probably caused by contention within the logic.
> !image-2025-01-20-23-38-58-896.png|width=300!
> [^cpu_profile_batch.html]
> The logic has at least the following 2 potential places to assess:
>  # allocation cycle in MemtablePool.SubPool#tryAllocate. This logic has a 
> while loop with a CAS, which can be non-efficient under a high contention, 
> similar to CASSANDRA-15922 we can try to replace it with addAndGet (need to 
> check if it does not break the allocator logic)
>  # swap region logic in NativeAllocator.trySwapRegion (under a high insert 
> rate 1MiB regions can be swapped quite frequently)
> Reproducing test details:
>  * test logic
> {code:java}
> ./tools/bin/cassandra-stress "user profile=./profile.yaml no-warmup 
> ops(insert=1) n=10m" -rate threads=100  -node somenode
> {code}
>  * Cassandra version: 5.0.3
>  * configuration changes compared to default:
> {code:java}
> memtable_allocation_type: offheap_objects
> memtable:
>   configurations:
>     skiplist:
>       class_name: SkipListMemtable
>     trie:
>       class_name: TrieMemtable
>       parameters:
>              shards: 32
>     default:
>       inherits: trie 
> {code}
>  * 1 node cluster
>  * OpenJDK jdk-17.0.12+7
>  * Linux kernel: 4.18.0-240.el8.x86_64
>  * CPU: 16 cores, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
>  * RAM: 46GiB



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-20226) Reduce contention in NativeAllocator.allocate

Reply via email to