On Wed, Sep 10, 2025 at 10:01:06AM +0200, Vlastimil Babka wrote: > Extend the sheaf infrastructure for more efficient kfree_rcu() handling. > For caches with sheaves, on each cpu maintain a rcu_free sheaf in > addition to main and spare sheaves. > > kfree_rcu() operations will try to put objects on this sheaf. Once full, > the sheaf is detached and submitted to call_rcu() with a handler that > will try to put it in the barn, or flush to slab pages using bulk free, > when the barn is full. Then a new empty sheaf must be obtained to put > more objects there. > > It's possible that no free sheaves are available to use for a new > rcu_free sheaf, and the allocation in kfree_rcu() context can only use > GFP_NOWAIT and thus may fail. In that case, fall back to the existing > kfree_rcu() implementation. > > Expected advantages: > - batching the kfree_rcu() operations, that could eventually replace the > existing batching > - sheaves can be reused for allocations via barn instead of being > flushed to slabs, which is more efficient > - this includes cases where only some cpus are allowed to process rcu > callbacks (Android) > > Possible disadvantage: > - objects might be waiting for more than their grace period (it is > determined by the last object freed into the sheaf), increasing memory > usage - but the existing batching does that too. > > Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny > implementation favors smaller memory footprint over performance. > > Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the > contexts where kfree_rcu() is called might not be compatible with taking > a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a > spinlock - the current kfree_rcu() implementation avoids doing that. > > Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches > that have them. This is not a cheap operation, but the barrier usage is > rare - currently kmem_cache_destroy() or on module unload. > > Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to > count how many kfree_rcu() used the rcu_free sheaf successfully and how > many had to fall back to the existing implementation. > > Signed-off-by: Vlastimil Babka <vba...@suse.cz> > --- > mm/slab.h | 3 + > mm/slab_common.c | 26 ++++++ > mm/slub.c | 266 > ++++++++++++++++++++++++++++++++++++++++++++++++++++++- > 3 files changed, 293 insertions(+), 2 deletions(-) > > @@ -3840,6 +3895,80 @@ static void flush_all(struct kmem_cache *s) > cpus_read_unlock(); > } > > +/* needed for kvfree_rcu_barrier() */ > +void flush_all_rcu_sheaves() > +{ > + struct slub_percpu_sheaves *pcs; > + struct slub_flush_work *sfw; > + struct kmem_cache *s; > + bool flushed = false; > + unsigned int cpu; > + > + cpus_read_lock(); > + mutex_lock(&slab_mutex); > + > + list_for_each_entry(s, &slab_caches, list) { > + if (!s->cpu_sheaves) > + continue; > + > + mutex_lock(&flush_lock); > + > + for_each_online_cpu(cpu) { > + sfw = &per_cpu(slub_flush, cpu); > + pcs = per_cpu_ptr(s->cpu_sheaves, cpu); > + > + if (!pcs->rcu_free || !pcs->rcu_free->size) {
Is the compiler allowed to compile this to read pcs->rcu_free twice? Something like: flush_all_rcu_sheaves() __kfree_rcu_sheaf() pcs->rcu_free != NULL pcs->rcu_free = NULL pcs->rcu_free == NULL /* NULL-pointer-deref */ pcs->rcu_free->size > + sfw->skip = true; > + continue; > + } > > + INIT_WORK(&sfw->work, flush_rcu_sheaf); > + sfw->skip = false; > + sfw->s = s; > + queue_work_on(cpu, flushwq, &sfw->work); > + flushed = true; > + } > + > + for_each_online_cpu(cpu) { > + sfw = &per_cpu(slub_flush, cpu); > + if (sfw->skip) > + continue; > + flush_work(&sfw->work); > + } > + > + mutex_unlock(&flush_lock); > + } > + > + mutex_unlock(&slab_mutex); > + cpus_read_unlock(); > + > + if (flushed) > + rcu_barrier(); I think we need to call rcu_barrier() even if flushed == false? Maybe a kvfree_rcu()'d object was already waiting for the rcu callback to be processed before flush_all_rcu_sheaves() is called, and in flush_all_rcu_sheaves() we skipped all (cache, cpu) pairs, so flushed == false but the rcu callback isn't processed yet by the end of the function? That sounds like a very unlikely to happen in a realistic scenario, but still possible... -- Cheers, Harry / Hyeonggon