Hi Vlastimil Babka,
On 10/16/2025 9:45 PM, Vlastimil Babka wrote:
On 10/16/25 17:16, D, Suneeth wrote:
Hi Vlastimil Babka,
On 9/10/2025 1:31 PM, Vlastimil Babka wrote:
Setup the maple_node_cache with percpu sheaves of size 32 to hopefully
improve its performance. Note this will not immediately take advantage
of sheaf batching of kfree_rcu() operations due to the maple tree using
call_rcu with custom callbacks. The followup changes to maple tree will
change that and also make use of the prefilled sheaves functionality.
We run will-it-scale-process-mmap2 micro-benchmark as part of our weekly
CI for Kernel Performance Regression testing between a stable vs rc
kernel. In this week's run we were able to observe severe regression on
AMD platforms (Turin and Bergamo) with running the micro-benchmark
between the kernels v6.17 and v6.18-rc1 in the range of 12-13% (Turin)
and 22-26% (Bergamo). Bisecting further landed me onto this commit
(59faa4da7cd4565cbce25358495556b75bb37022) as first bad commit. The
following were the machines' configuration and test parameters used:-
Model name: AMD EPYC 128-Core Processor [Bergamo]
Thread(s) per core: 2
Core(s) per socket: 128
Socket(s): 1
Total online memory: 258G
Model name: AMD EPYC 64-Core Processor [Turin]
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 1
Total online memory: 258G
Test params:
nr_task: [1 8 64 128 192 256]
mode: process
test: mmap2
kpi: per_process_ops
cpufreq_governor: performance
The following are the stats after bisection:-
(the KPI used here is per_process_ops)
kernel_versions per_process_ops
--------------- ---------------
v6.17.0 - 258291
v6.18.0-rc1 - 225839
v6.17.0-rc3-59faa4da7 - 212152
v6.17.0-rc3-3accabda4da1(one commit before bad commit) - 265054
Thanks for the info. Is there any difference if you increase the
sheaf_capacity in the commit from 32 to a higher value? For example 120 to
match what the automatically calculated cpu partial slabs target would be.
I think there's a lock contention on the barn lock causing the regression.
By matching the cpu partial slabs value we should have same batching factor
for the barn lock as there would be on the node list_lock before sheaves.
Thanks.
I tried changing the sheaf_capacity from 32 to 120 and tested it. The
numbers are improving around 28% w.r.t baseline(6.17) with
will-it-scale-mmap2-process testcase.
v6.17.0(w/o sheaf) %diff v6.18-rc1(sheaf=32) %diff v6.18-rc1(sheaf=120)
------------------ ----- ------------------- ----- --------------------
260222 -13 225839 +28 334079
Thanks.
Recreation steps:
1) git clone https://github.com/antonblanchard/will-it-scale.git
2) git clone https://github.com/intel/lkp-tests.git
3) cd will-it-scale && git apply
lkp-tests/programs/will-it-scale/pkg/will-it-scale.patch
4) make
5) python3 runtest.py mmap2 25 process 0 0 1 8 64 128 192 256
NOTE: [5] is specific to machine's architecture. starting from 1 is the
array of no.of tasks that you'd wish to run the testcase which here is
no.cores per CCX, per NUMA node/ per Socket, nr_threads.
I also ran the micro-benchmark with tools/testing/perf record and
following is the collected data:-
# perf diff perf.data.old perf.data
No kallsyms or vmlinux with build-id
0fc9c7b62ade1502af5d6a060914732523f367ef was found
Warning:
43 out of order events recorded.
Warning:
54 out of order events recorded.
# Event 'cycles:P'
#
# Baseline Delta Abs Shared Object Symbol
# ........ ......... ......................
..............................................
#
+51.51% [kernel.kallsyms] [k]
native_queued_spin_lock_slowpath
+14.39% [kernel.kallsyms] [k] perf_iterate_ctx
+2.52% [kernel.kallsyms] [k] unmap_page_range
+1.75% [kernel.kallsyms] [k] mas_wr_node_store
+1.47% [kernel.kallsyms] [k] __pi_memset
+1.38% [kernel.kallsyms] [k] mt_free_rcu
+1.36% [kernel.kallsyms] [k] free_pgd_range
+1.10% [kernel.kallsyms] [k] __pi_memcpy
+0.96% [kernel.kallsyms] [k] __kmem_cache_alloc_bulk
+0.92% [kernel.kallsyms] [k] __mmap_region
+0.79% [kernel.kallsyms] [k] mas_empty_area_rev
+0.74% [kernel.kallsyms] [k] __cond_resched
+0.73% [kernel.kallsyms] [k] mas_walk
+0.59% [kernel.kallsyms] [k] mas_pop_node
+0.57% [kernel.kallsyms] [k] perf_event_mmap_output
+0.49% [kernel.kallsyms] [k] mas_find
+0.48% [kernel.kallsyms] [k] mas_next_slot
+0.46% [kernel.kallsyms] [k] kmem_cache_free
+0.42% [kernel.kallsyms] [k] mas_leaf_max_gap
+0.42% [kernel.kallsyms] [k]
__call_rcu_common.constprop.0
+0.39% [kernel.kallsyms] [k] entry_SYSCALL_64
+0.38% [kernel.kallsyms] [k] mas_prev_slot
+0.38% [kernel.kallsyms] [k] kmem_cache_alloc_noprof
+0.37% [kernel.kallsyms] [k] mas_store_gfp
Reviewed-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
lib/maple_tree.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index
4f0e30b57b0cef9e5cf791f3f64f5898752db402..d034f170ac897341b40cfd050b6aee86b6d2cf60
100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -6040,9 +6040,14 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
void __init maple_tree_init(void)
{
+ struct kmem_cache_args args = {
+ .align = sizeof(struct maple_node),
+ .sheaf_capacity = 32,
+ };
+
maple_node_cache = kmem_cache_create("maple_node",
- sizeof(struct maple_node), sizeof(struct maple_node),
- SLAB_PANIC, NULL);
+ sizeof(struct maple_node), &args,
+ SLAB_PANIC);
}
/**
---
Thanks and Regards
Suneeth D