On Fri, Oct 31, 2025 at 10:32:54PM +0100, Daniel Gomez wrote: > > > On 10/09/2025 10.01, Vlastimil Babka wrote: > > Extend the sheaf infrastructure for more efficient kfree_rcu() handling. > > For caches with sheaves, on each cpu maintain a rcu_free sheaf in > > addition to main and spare sheaves. > > > > kfree_rcu() operations will try to put objects on this sheaf. Once full, > > the sheaf is detached and submitted to call_rcu() with a handler that > > will try to put it in the barn, or flush to slab pages using bulk free, > > when the barn is full. Then a new empty sheaf must be obtained to put > > more objects there. > > > > It's possible that no free sheaves are available to use for a new > > rcu_free sheaf, and the allocation in kfree_rcu() context can only use > > GFP_NOWAIT and thus may fail. In that case, fall back to the existing > > kfree_rcu() implementation. > > > > Expected advantages: > > - batching the kfree_rcu() operations, that could eventually replace the > > existing batching > > - sheaves can be reused for allocations via barn instead of being > > flushed to slabs, which is more efficient > > - this includes cases where only some cpus are allowed to process rcu > > callbacks (Android) > > > > Possible disadvantage: > > - objects might be waiting for more than their grace period (it is > > determined by the last object freed into the sheaf), increasing memory > > usage - but the existing batching does that too. > > > > Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny > > implementation favors smaller memory footprint over performance. > > > > Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the > > contexts where kfree_rcu() is called might not be compatible with taking > > a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a > > spinlock - the current kfree_rcu() implementation avoids doing that. > > > > Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches > > that have them. This is not a cheap operation, but the barrier usage is > > rare - currently kmem_cache_destroy() or on module unload. > > > > Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to > > count how many kfree_rcu() used the rcu_free sheaf successfully and how > > many had to fall back to the existing implementation. > > > > Signed-off-by: Vlastimil Babka <[email protected]> > > Hi Vlastimil, > > This patch increases kmod selftest (stress module loader) runtime by about > ~50-60%, from ~200s to ~300s total execution time. My tested kernel has > CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be > causing this, or how to address it?
This is likely due to increased kvfree_rcu_barrier() during module unload. It currently iterates over all CPUs x slab caches (that enabled sheaves, there should be only a few now) pair to make sure rcu sheaf is flushed by the time kvfree_rcu_barrier() returns. Just being curious, do you have any serious workload that depends on the performance of module unload? -- Cheers, Harry / Hyeonggon

