On 27/11/2025 11:38, Jon Hunter wrote:
On 31/10/2025 21:32, Daniel Gomez wrote:
On 10/09/2025 10.01, Vlastimil Babka wrote:
Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
For caches with sheaves, on each cpu maintain a rcu_free sheaf in
addition to main and spare sheaves.
kfree_rcu() operations will try to put objects on this sheaf. Once full,
the sheaf is detached and submitted to call_rcu() with a handler that
will try to put it in the barn, or flush to slab pages using bulk free,
when the barn is full. Then a new empty sheaf must be obtained to put
more objects there.
It's possible that no free sheaves are available to use for a new
rcu_free sheaf, and the allocation in kfree_rcu() context can only use
GFP_NOWAIT and thus may fail. In that case, fall back to the existing
kfree_rcu() implementation.
Expected advantages:
- batching the kfree_rcu() operations, that could eventually replace the
existing batching
- sheaves can be reused for allocations via barn instead of being
flushed to slabs, which is more efficient
- this includes cases where only some cpus are allowed to process rcu
callbacks (Android)
Possible disadvantage:
- objects might be waiting for more than their grace period (it is
determined by the last object freed into the sheaf), increasing
memory
usage - but the existing batching does that too.
Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
implementation favors smaller memory footprint over performance.
Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
contexts where kfree_rcu() is called might not be compatible with taking
a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
spinlock - the current kfree_rcu() implementation avoids doing that.
Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
that have them. This is not a cheap operation, but the barrier usage is
rare - currently kmem_cache_destroy() or on module unload.
Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
count how many kfree_rcu() used the rcu_free sheaf successfully and how
many had to fall back to the existing implementation.
Signed-off-by: Vlastimil Babka <[email protected]>
Hi Vlastimil,
This patch increases kmod selftest (stress module loader) runtime by
about
~50-60%, from ~200s to ~300s total execution time. My tested kernel has
CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what
might be
causing this, or how to address it?
I have been looking into a regression for Linux v6.18-rc where time
taken to run some internal graphics tests on our Tegra234 device has
increased from around 35% causing the tests to timeout. Bisect is
I meant 'increased by around 35%'.
pointing to this commit and I also see we have CONFIG_KVFREE_RCU_BATCHED=y.
I have not tried disabling CONFIG_KVFREE_RCU_BATCHED=y but I can. I am
not sure if there are any downsides to disabling this?
Thanks
Jon
--
nvpublic