On Wed, Jul 23, 2025 at 03:34:34PM +0200, Vlastimil Babka wrote:
> Specifying a non-zero value for a new struct kmem_cache_args field
> sheaf_capacity will setup a caching layer of percpu arrays called
> sheaves of given capacity for the created cache.
> 
> Allocations from the cache will allocate via the percpu sheaves (main or
> spare) as long as they have no NUMA node preference. Frees will also
> put the object back into one of the sheaves.
> 
> When both percpu sheaves are found empty during an allocation, an empty
> sheaf may be replaced with a full one from the per-node barn. If none
> are available and the allocation is allowed to block, an empty sheaf is
> refilled from slab(s) by an internal bulk alloc operation. When both
> percpu sheaves are full during freeing, the barn can replace a full one
> with an empty one, unless over a full sheaves limit. In that case a
> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
> sheaves and barns is also wired to the existing cpu flushing and cache
> shrinking operations.
> 
> The sheaves do not distinguish NUMA locality of the cached objects. If
> an allocation is requested with kmem_cache_alloc_node() (or a mempolicy
> with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE),
> the sheaves are bypassed.
> 
> The bulk operations exposed to slab users also try to utilize the
> sheaves as long as the necessary (full or empty) sheaves are available
> on the cpu or in the barn. Once depleted, they will fallback to bulk
> alloc/free to slabs directly to avoid double copying.
> 
> The sheaf_capacity value is exported in sysfs for observability.
> 
> Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf
> count objects allocated or freed using the sheaves (and thus not
> counting towards the other alloc/free path counters). Counters
> sheaf_refill and sheaf_flush count objects filled or flushed from or to
> slab pages, and can be used to assess how effective the caching is. The
> refill and flush operations will also count towards the usual
> alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for
> the backing slabs.  For barn operations, barn_get and barn_put count how
> many full sheaves were get from or put to the barn, the _fail variants
> count how many such requests could not be satisfied mainly  because the
> barn was either empty or full. While the barn also holds empty sheaves
> to make some operations easier, these are not as critical to mandate own
> counters.  Finally, there are sheaf_alloc/sheaf_free counters.
> 
> Access to the percpu sheaves is protected by local_trylock() when
> potential callers include irq context, and local_lock() otherwise (such
> as when we already know the gfp flags allow blocking). The trylock
> failures should be rare and we can easily fallback. Each per-NUMA-node
> barn has a spin_lock.
> 
> When slub_debug is enabled for a cache with sheaf_capacity also
> specified, the latter is ignored so that allocations and frees reach the
> slow path where debugging hooks are processed. Similarly, we ignore it
> with CONFIG_SLUB_TINY which prefers low memory usage to performance.
> 
> Signed-off-by: Vlastimil Babka <vba...@suse.cz>
> ---
>  include/linux/slab.h |   31 ++
>  mm/slab.h            |    2 +
>  mm/slab_common.c     |    5 +-
>  mm/slub.c            | 1101 
> +++++++++++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 1092 insertions(+), 47 deletions(-)
> 
> @@ -4554,6 +5164,274 @@ static void __slab_free(struct kmem_cache *s, struct 
> slab *slab,
>       discard_slab(s, slab);
>  }
>  
> +/*
> + * pcs is locked. We should have get rid of the spare sheaf and obtained an
> + * empty sheaf, while the main sheaf is full. We want to install the empty 
> sheaf
> + * as a main sheaf, and make the current main sheaf a spare sheaf.
> + *
> + * However due to having relinquished the cpu_sheaves lock when obtaining
> + * the empty sheaf, we need to handle some unlikely but possible cases.
> + *
> + * If we put any sheaf to barn here, it's because we were interrupted or have
> + * been migrated to a different cpu, which should be rare enough so just 
> ignore
> + * the barn's limits to simplify the handling.
> + *
> + * An alternative scenario that gets us here is when we fail
> + * barn_replace_full_sheaf(), because there's no empty sheaf available in the
> + * barn, so we had to allocate it by alloc_empty_sheaf(). But because we saw 
> the
> + * limit on full sheaves was not exceeded, we assume it didn't change and 
> just
> + * put the full sheaf there.
> + */
> +static void __pcs_install_empty_sheaf(struct kmem_cache *s,
> +             struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty)
> +{
> +     /* This is what we expect to find if nobody interrupted us. */
> +     if (likely(!pcs->spare)) {
> +             pcs->spare = pcs->main;
> +             pcs->main = empty;
> +             return;
> +     }
> +
> +     /*
> +      * Unlikely because if the main sheaf had space, we would have just
> +      * freed to it. Get rid of our empty sheaf.
> +      */
> +     if (pcs->main->size < s->sheaf_capacity) {
> +             barn_put_empty_sheaf(pcs->barn, empty);
> +             return;
> +     }
> +
> +     /* Also unlikely for the same reason/ */

nit: unnecessary '/'

> +     if (pcs->spare->size < s->sheaf_capacity) {
> +             swap(pcs->main, pcs->spare);
> +             barn_put_empty_sheaf(pcs->barn, empty);
> +             return;
> +     }
> +
> +     /*
> +      * We probably failed barn_replace_full_sheaf() due to no empty sheaf
> +      * available there, but we allocated one, so finish the job.
> +      */
> +     barn_put_full_sheaf(pcs->barn, pcs->main);
> +     stat(s, BARN_PUT);
> +     pcs->main = empty;
> +}

> +static struct slub_percpu_sheaves *
> +__pcs_handle_full(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
> +{
> +     struct slab_sheaf *empty;
> +     bool put_fail;
> +
> +restart:
> +     put_fail = false;
> +
> +     if (!pcs->spare) {
> +             empty = barn_get_empty_sheaf(pcs->barn);
> +             if (empty) {
> +                     pcs->spare = pcs->main;
> +                     pcs->main = empty;
> +                     return pcs;
> +             }
> +             goto alloc_empty;
> +     }
> +
> +     if (pcs->spare->size < s->sheaf_capacity) {
> +             swap(pcs->main, pcs->spare);
> +             return pcs;
> +     }
> +
> +     empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> +
> +     if (!IS_ERR(empty)) {
> +             stat(s, BARN_PUT);
> +             pcs->main = empty;
> +             return pcs;
> +     }
> +
> +     if (PTR_ERR(empty) == -E2BIG) {
> +             /* Since we got here, spare exists and is full */
> +             struct slab_sheaf *to_flush = pcs->spare;
> +
> +             stat(s, BARN_PUT_FAIL);
> +
> +             pcs->spare = NULL;
> +             local_unlock(&s->cpu_sheaves->lock);
> +
> +             sheaf_flush_unused(s, to_flush);
> +             empty = to_flush;
> +             goto got_empty;
> +     }
> +
> +     /*
> +      * We could not replace full sheaf because barn had no empty
> +      * sheaves. We can still allocate it and put the full sheaf in
> +      * __pcs_install_empty_sheaf(), but if we fail to allocate it,
> +      * make sure to count the fail.
> +      */
> +     put_fail = true;
> +
> +alloc_empty:
> +     local_unlock(&s->cpu_sheaves->lock);
> +
> +     empty = alloc_empty_sheaf(s, GFP_NOWAIT);
> +     if (empty)
> +             goto got_empty;
> +
> +     if (put_fail)
> +              stat(s, BARN_PUT_FAIL);
> +
> +     if (!sheaf_flush_main(s))
> +             return NULL;
> +
> +     if (!local_trylock(&s->cpu_sheaves->lock))
> +             return NULL;
> +
> +     /*
> +      * we flushed the main sheaf so it should be empty now,
> +      * but in case we got preempted or migrated, we need to
> +      * check again
> +      */
> +     if (pcs->main->size == s->sheaf_capacity)
> +             goto restart;

I think it's missing:

pcs = this_cpu_ptr(&s->cpu_sheaves);

between local_trylock() and reading pcs->main->size().

> +
> +     return pcs;
> +
> +got_empty:
> +     if (!local_trylock(&s->cpu_sheaves->lock)) {
> +             barn_put_empty_sheaf(pcs->barn, empty);
> +             return NULL;
> +     }
> +
> +     pcs = this_cpu_ptr(s->cpu_sheaves);
> +     __pcs_install_empty_sheaf(s, pcs, empty);
> +
> +     return pcs;
> +}
> +
>  #ifndef CONFIG_SLUB_TINY
>  /*
>   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
> @@ -6481,7 +7464,6 @@ int do_kmem_cache_create(struct kmem_cache *s, const 
> char *name,
>               __kmem_cache_release(s);
>       return err;
>  }
> -

nit: unnecessary removal of a newline?

Otherwise looks good to me.

>  #ifdef SLAB_SUPPORTS_SYSFS
>  static int count_inuse(struct slab *slab)
>  {

-- 
Cheers,
Harry / Hyeonggon

Reply via email to