[
https://issues.apache.org/jira/browse/CASSANDRA-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18036790#comment-18036790
]
Dmitry Konstantinov commented on CASSANDRA-20226:
-------------------------------------------------
Hi [~blambov] , thank you for the idea, it is really interesting. I think it is
complementary to the current changes and can be applied on top of them: while
I've tuned a bit the accounting logic the main idea of the current changes is
about introducing to predict and pre-allocate memory for the incoming mutation
- it is not only reduce overheads for accounting logic by using it less
frequently but also give a better locality for allocated cells in slabs/offheap
(so it is more CPU cache friendly), so it is useful even if we remove
accounting overheads for the per-cell allocation logic.
{quote}
This means that the limit will be breached, but this also happens as it stands
now because we will permit operations to run to completion if the memtable they
have been marked for is scheduled for a flushÂ
{quote}
I suppose except the case when we have writes to several tables in parallel and
we are flushing one (largest) memtable but block allocation in others..
But in general, I agree - it looks benefitial to split allocation itself and
limits checking/blocking.
Currently, memory allocation itself (in byte buffer or native slabs) and
accounting (to initiate flush and to pause writes) are tightly connected, so it
may take some time to restructure this logic to decouple them. My suggestion is
to extract this improvement into a separate story (I can take), to not
over-complicate the current change and to no delay it too much. If it is ok I
create the story.
> Reduce contention in MemtableAllocator.allocate
> -----------------------------------------------
>
> Key: CASSANDRA-20226
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20226
> Project: Apache Cassandra
> Issue Type: Improvement
> Components: Local/Memtable
> Reporter: Dmitry Konstantinov
> Assignee: Dmitry Konstantinov
> Priority: Normal
> Fix For: 5.x
>
> Attachments: 5.1_batch_LongAdder.html, 5.1_batch_addAndGet.html,
> 5.1_batch_alloc_batching.html, 5.1_batch_baseline.html,
> 5.1_batch_pad_allocated.html, CASSANDRA-20226_ci_summary.htm,
> CASSANDRA-20226_results_details.tar.xz,
> ci_summary_netudima_CASSANDRA-20226-trunk_52.html, cpu_profile_batch.html,
> image-2025-01-20-23-38-58-896.png, image-2025-11-10-00-04-57-497.png,
> profile.yaml, results_details_netudima_CASSANDRA-20226-trunk_52.tar.xz,
> test_results_m8i.4xlarge_heap_buffers.html,
> test_results_m8i.4xlarge_heap_buffers.png,
> test_results_m8i.4xlarge_offheap_objects.html,
> test_results_m8i.4xlarge_offheap_objects.png
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> For a high insert batch rate it looks like we have a bottleneck in
> NativeAllocator.allocate probably caused by contention within the logic.
> !image-2025-01-20-23-38-58-896.png|width=300!
> [^cpu_profile_batch.html]
> The logic has at least the following 2 potential places to assess:
> # allocation cycle in MemtablePool.SubPool#tryAllocate. This logic has a
> while loop with a CAS, which can be non-efficient under a high contention,
> similar to CASSANDRA-15922 we can try to replace it with addAndGet (need to
> check if it does not break the allocator logic)
> # swap region logic in NativeAllocator.trySwapRegion (under a high insert
> rate 1MiB regions can be swapped quite frequently)
> Reproducing test details:
> * test logic
> {code:java}
> ./tools/bin/cassandra-stress "user profile=./profile.yaml no-warmup
> ops(insert=1) n=10m" -rate threads=100 -node somenode
> {code}
> * Cassandra version: 5.0.3
> * configuration changes compared to default:
> {code:java}
> memtable_allocation_type: offheap_objects
> memtable:
> configurations:
> skiplist:
> class_name: SkipListMemtable
> trie:
> class_name: TrieMemtable
> parameters:
> shards: 32
> default:
> inherits: trie
> {code}
> * 1 node cluster
> * OpenJDK jdk-17.0.12+7
> * Linux kernel: 4.18.0-240.el8.x86_64
> * CPU: 16 cores, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
> * RAM: 46GiB
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]