On Tue, Nov 12, 2024 at 05:38:46PM +0100, Vlastimil Babka wrote: > Extend the sheaf infrastructure for more efficient kfree_rcu() handling. > For caches where sheafs are initialized, on each cpu maintain a rcu_free > sheaf in addition to main and spare sheaves. > > kfree_rcu() operations will try to put objects on this sheaf. Once full, > the sheaf is detached and submitted to call_rcu() with a handler that > will try to put in on the barn, or flush to slab pages using bulk free, > when the barn is full. Then a new empty sheaf must be obtained to put > more objects there. > > It's possible that no free sheafs are available to use for a new > rcu_free sheaf, and the allocation in kfree_rcu() context can only use > GFP_NOWAIT and thus may fail. In that case, fall back to the existing > kfree_rcu() machinery. > > Because some intended users will need to perform additonal cleanups > after the grace period and thus have custom rcu_call() callbacks today, > add the possibility to specify a kfree_rcu() specific destructor. > Because of the fall back possibility, the destructor now needs be > invoked also from within RCU, so add __kvfree_rcu() that RCU can use > instead of kvfree(). > > Expected advantages: > - batching the kfree_rcu() operations, that could eventually replace the > batching done in RCU itself > - sheafs can be reused via barn instead of being flushed to slabs, which > is more effective > - this includes cases where only some cpus are allowed to process rcu > callbacks (Android) > > Possible disadvantage: > - objects might be waiting for more than their grace period (it is > determined by the last object freed into the sheaf), increasing memory > usage - but that might be true for the batching done by RCU as well? > > RFC LIMITATIONS: - only tree rcu is converted, not tiny > - the rcu fallback might resort to kfree_bulk(), not kvfree(). Instead > of adding a variant of kfree_bulk() with destructors, is there an easy > way to disable the kfree_bulk() path in the fallback case? > > Signed-off-by: Vlastimil Babka <vba...@suse.cz> > --- > include/linux/slab.h | 15 +++++ > kernel/rcu/tree.c | 8 ++- > mm/slab.h | 25 +++++++ > mm/slab_common.c | 3 + > mm/slub.c | 182 > +++++++++++++++++++++++++++++++++++++++++++++++++-- > 5 files changed, 227 insertions(+), 6 deletions(-) > > diff --git a/include/linux/slab.h b/include/linux/slab.h > index > b13fb1c1f03c14a5b45bc6a64a2096883aef9f83..23904321992ad2eeb9389d0883cf4d5d5d71d896 > 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -343,6 +343,21 @@ struct kmem_cache_args { > * %0 means no sheaves will be created > */ > unsigned int sheaf_capacity; > + /** > + * @sheaf_rcu_dtor: A destructor for objects freed by kfree_rcu() > + * > + * Only valid when non-zero @sheaf_capacity is specified. When freeing > + * objects by kfree_rcu() in a cache with sheaves, the objects are put > + * to a special percpu sheaf. When that sheaf is full, it's passed to > + * call_rcu() and after a grace period the sheaf can be reused for new > + * allocations. In case a cleanup is necessary after the grace period > + * and before reusal, a pointer to such function can be given as > + * @sheaf_rcu_dtor and will be called on each object in the rcu sheaf > + * after the grace period passes and before the sheaf's reuse. > + * > + * %NULL means no destructor is called. > + */ > + void (*sheaf_rcu_dtor)(void *obj); > }; > > struct kmem_cache *__kmem_cache_create_args(const char *name, > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index > b1f883fcd9185a5e22c10102d1024c40688f57fb..42c994fdf9960bfed8d8bd697de90af72c1f4f58 > 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -65,6 +65,7 @@ > #include <linux/kasan.h> > #include <linux/context_tracking.h> > #include "../time/tick-internal.h" > +#include "../../mm/slab.h" > > #include "tree.h" > #include "rcu.h" > @@ -3420,7 +3421,7 @@ kvfree_rcu_list(struct rcu_head *head) > trace_rcu_invoke_kvfree_callback(rcu_state.name, head, offset); > > if (!WARN_ON_ONCE(!__is_kvfree_rcu_offset(offset))) > - kvfree(ptr); > + __kvfree_rcu(ptr); > > rcu_lock_release(&rcu_callback_map); > cond_resched_tasks_rcu_qs(); > @@ -3797,6 +3798,9 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr) > if (!head) > might_sleep(); > > + if (kfree_rcu_sheaf(ptr)) > + return; > + > This change crosses all effort which has been done in order to improve kvfree_rcu :)
For example: performance, app launch improvements for Android devices; memory consumption optimizations to minimize LMK triggering; batching to speed-up offloading; etc. So we have done a lot of work there. We were thinking about moving all functionality from "kernel/rcu" to "mm/". As a first step i can do that, i.e. move kvfree_rcu() as is. After that we can switch to second step. Sounds good for you or not? -- Uladzislau Rezki