On Wed, Jul 23, 2025 at 03:34:34PM +0200, Vlastimil Babka wrote: > Specifying a non-zero value for a new struct kmem_cache_args field > sheaf_capacity will setup a caching layer of percpu arrays called > sheaves of given capacity for the created cache. > > Allocations from the cache will allocate via the percpu sheaves (main or > spare) as long as they have no NUMA node preference. Frees will also > put the object back into one of the sheaves. > > When both percpu sheaves are found empty during an allocation, an empty > sheaf may be replaced with a full one from the per-node barn. If none > are available and the allocation is allowed to block, an empty sheaf is > refilled from slab(s) by an internal bulk alloc operation. When both > percpu sheaves are full during freeing, the barn can replace a full one > with an empty one, unless over a full sheaves limit. In that case a > sheaf is flushed to slab(s) by an internal bulk free operation. Flushing > sheaves and barns is also wired to the existing cpu flushing and cache > shrinking operations. > > The sheaves do not distinguish NUMA locality of the cached objects. If > an allocation is requested with kmem_cache_alloc_node() (or a mempolicy > with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE), > the sheaves are bypassed. > > The bulk operations exposed to slab users also try to utilize the > sheaves as long as the necessary (full or empty) sheaves are available > on the cpu or in the barn. Once depleted, they will fallback to bulk > alloc/free to slabs directly to avoid double copying. > > The sheaf_capacity value is exported in sysfs for observability. > > Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf > count objects allocated or freed using the sheaves (and thus not > counting towards the other alloc/free path counters). Counters > sheaf_refill and sheaf_flush count objects filled or flushed from or to > slab pages, and can be used to assess how effective the caching is. The > refill and flush operations will also count towards the usual > alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for > the backing slabs. For barn operations, barn_get and barn_put count how > many full sheaves were get from or put to the barn, the _fail variants > count how many such requests could not be satisfied mainly because the > barn was either empty or full. While the barn also holds empty sheaves > to make some operations easier, these are not as critical to mandate own > counters. Finally, there are sheaf_alloc/sheaf_free counters. > > Access to the percpu sheaves is protected by local_trylock() when > potential callers include irq context, and local_lock() otherwise (such > as when we already know the gfp flags allow blocking). The trylock > failures should be rare and we can easily fallback. Each per-NUMA-node > barn has a spin_lock. > > When slub_debug is enabled for a cache with sheaf_capacity also > specified, the latter is ignored so that allocations and frees reach the > slow path where debugging hooks are processed. Similarly, we ignore it > with CONFIG_SLUB_TINY which prefers low memory usage to performance. > > Signed-off-by: Vlastimil Babka <vba...@suse.cz> > --- > include/linux/slab.h | 31 ++ > mm/slab.h | 2 + > mm/slab_common.c | 5 +- > mm/slub.c | 1101 > +++++++++++++++++++++++++++++++++++++++++++++++--- > 4 files changed, 1092 insertions(+), 47 deletions(-) > > @@ -4554,6 +5164,274 @@ static void __slab_free(struct kmem_cache *s, struct > slab *slab, > discard_slab(s, slab); > } > > +/* > + * pcs is locked. We should have get rid of the spare sheaf and obtained an > + * empty sheaf, while the main sheaf is full. We want to install the empty > sheaf > + * as a main sheaf, and make the current main sheaf a spare sheaf. > + * > + * However due to having relinquished the cpu_sheaves lock when obtaining > + * the empty sheaf, we need to handle some unlikely but possible cases. > + * > + * If we put any sheaf to barn here, it's because we were interrupted or have > + * been migrated to a different cpu, which should be rare enough so just > ignore > + * the barn's limits to simplify the handling. > + * > + * An alternative scenario that gets us here is when we fail > + * barn_replace_full_sheaf(), because there's no empty sheaf available in the > + * barn, so we had to allocate it by alloc_empty_sheaf(). But because we saw > the > + * limit on full sheaves was not exceeded, we assume it didn't change and > just > + * put the full sheaf there. > + */ > +static void __pcs_install_empty_sheaf(struct kmem_cache *s, > + struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty) > +{ > + /* This is what we expect to find if nobody interrupted us. */ > + if (likely(!pcs->spare)) { > + pcs->spare = pcs->main; > + pcs->main = empty; > + return; > + } > + > + /* > + * Unlikely because if the main sheaf had space, we would have just > + * freed to it. Get rid of our empty sheaf. > + */ > + if (pcs->main->size < s->sheaf_capacity) { > + barn_put_empty_sheaf(pcs->barn, empty); > + return; > + } > + > + /* Also unlikely for the same reason/ */
nit: unnecessary '/' > + if (pcs->spare->size < s->sheaf_capacity) { > + swap(pcs->main, pcs->spare); > + barn_put_empty_sheaf(pcs->barn, empty); > + return; > + } > + > + /* > + * We probably failed barn_replace_full_sheaf() due to no empty sheaf > + * available there, but we allocated one, so finish the job. > + */ > + barn_put_full_sheaf(pcs->barn, pcs->main); > + stat(s, BARN_PUT); > + pcs->main = empty; > +} > +static struct slub_percpu_sheaves * > +__pcs_handle_full(struct kmem_cache *s, struct slub_percpu_sheaves *pcs) > +{ > + struct slab_sheaf *empty; > + bool put_fail; > + > +restart: > + put_fail = false; > + > + if (!pcs->spare) { > + empty = barn_get_empty_sheaf(pcs->barn); > + if (empty) { > + pcs->spare = pcs->main; > + pcs->main = empty; > + return pcs; > + } > + goto alloc_empty; > + } > + > + if (pcs->spare->size < s->sheaf_capacity) { > + swap(pcs->main, pcs->spare); > + return pcs; > + } > + > + empty = barn_replace_full_sheaf(pcs->barn, pcs->main); > + > + if (!IS_ERR(empty)) { > + stat(s, BARN_PUT); > + pcs->main = empty; > + return pcs; > + } > + > + if (PTR_ERR(empty) == -E2BIG) { > + /* Since we got here, spare exists and is full */ > + struct slab_sheaf *to_flush = pcs->spare; > + > + stat(s, BARN_PUT_FAIL); > + > + pcs->spare = NULL; > + local_unlock(&s->cpu_sheaves->lock); > + > + sheaf_flush_unused(s, to_flush); > + empty = to_flush; > + goto got_empty; > + } > + > + /* > + * We could not replace full sheaf because barn had no empty > + * sheaves. We can still allocate it and put the full sheaf in > + * __pcs_install_empty_sheaf(), but if we fail to allocate it, > + * make sure to count the fail. > + */ > + put_fail = true; > + > +alloc_empty: > + local_unlock(&s->cpu_sheaves->lock); > + > + empty = alloc_empty_sheaf(s, GFP_NOWAIT); > + if (empty) > + goto got_empty; > + > + if (put_fail) > + stat(s, BARN_PUT_FAIL); > + > + if (!sheaf_flush_main(s)) > + return NULL; > + > + if (!local_trylock(&s->cpu_sheaves->lock)) > + return NULL; > + > + /* > + * we flushed the main sheaf so it should be empty now, > + * but in case we got preempted or migrated, we need to > + * check again > + */ > + if (pcs->main->size == s->sheaf_capacity) > + goto restart; I think it's missing: pcs = this_cpu_ptr(&s->cpu_sheaves); between local_trylock() and reading pcs->main->size(). > + > + return pcs; > + > +got_empty: > + if (!local_trylock(&s->cpu_sheaves->lock)) { > + barn_put_empty_sheaf(pcs->barn, empty); > + return NULL; > + } > + > + pcs = this_cpu_ptr(s->cpu_sheaves); > + __pcs_install_empty_sheaf(s, pcs, empty); > + > + return pcs; > +} > + > #ifndef CONFIG_SLUB_TINY > /* > * Fastpath with forced inlining to produce a kfree and kmem_cache_free that > @@ -6481,7 +7464,6 @@ int do_kmem_cache_create(struct kmem_cache *s, const > char *name, > __kmem_cache_release(s); > return err; > } > - nit: unnecessary removal of a newline? Otherwise looks good to me. > #ifdef SLAB_SUPPORTS_SYSFS > static int count_inuse(struct slab *slab) > { -- Cheers, Harry / Hyeonggon