On Thu, Mar 26, 2026 at 10:58 AM Will Deacon <[email protected]> wrote: > > On Thu, Mar 26, 2026 at 03:26:07AM -0700, Puranjay Mohan wrote: > > On architectures like arm64, this_cpu_inc() wraps the underlying atomic > > instruction (ldadd) with preempt_disable/enable to prevent migration > > between the per-CPU address calculation and the atomic operation. > > However, SRCU does not need this protection because it sums counters > > across all CPUs for grace-period detection, so operating on a "stale" > > CPU's counter after migration is harmless. > > > > This commit therefore introduces srcu_percpu_counter_inc(), which > > consolidates the SRCU-fast reader counter updates into a single helper, > > replacing the if/else dispatch between this_cpu_inc() and > > atomic_long_inc(raw_cpu_ptr(...)) that was previously open-coded at > > each call site. > > > > On arm64, this helper uses atomic_long_fetch_add_relaxed(), which > > compiles to the value-returning ldadd instruction. This is preferred > > over atomic_long_inc()'s non-value-returning stadd because ldadd is > > resolved in L1 cache whereas stadd may be resolved further out in the > > memory hierarchy [1]. > > > > On x86, where this_cpu_inc() compiles to a single "incl %gs:offset" > > instruction with no preempt wrappers, the helper falls through to > > this_cpu_inc(), so there is no change. Architectures with > > NEED_SRCU_NMI_SAFE continue to use atomic_long_inc(raw_cpu_ptr(...)), > > again with no change. All remaining architectures also use the > > this_cpu_inc() path, again with no change. > > > > refscale measurements on a 72-CPU arm64 Neoverse-V2 system show ~11% > > improvement in SRCU-fast reader duration: > > > > Unpatched: median 9.273 ns, avg 9.319 ns (min 9.219, max 9.853) > > Patched: median 8.275 ns, avg 8.411 ns (min 8.186, max 9.183) > > > > Command: kvm.sh --torture refscale --duration 1 --cpus 72 \ > > --configs NOPREEMPT --trust-make --bootargs \ > > "refscale.scale_type=srcu-fast refscale.nreaders=72 \ > > refscale.nruns=100" > > > > [1] > > https://lore.kernel.org/r/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop > > > > Signed-off-by: Puranjay Mohan <[email protected]> > > --- > > include/linux/srcutree.h | 51 +++++++++++++++++++++++++++------------- > > 1 file changed, 35 insertions(+), 16 deletions(-) > > > > diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h > > index fd1a9270cb9a..4ff18de3edfd 100644 > > --- a/include/linux/srcutree.h > > +++ b/include/linux/srcutree.h > > @@ -286,15 +286,43 @@ static inline struct srcu_ctr __percpu > > *__srcu_ctr_to_ptr(struct srcu_struct *ss > > * on architectures that support NMIs but do not supply NMI-safe > > * implementations of this_cpu_inc(). > > */ > > + > > +/* > > + * Atomically increment a per-CPU SRCU counter. > > + * > > + * On most architectures, this_cpu_inc() is optimal (e.g., on x86 it is > > + * a single "incl %gs:offset" instruction). However, on architectures > > + * like arm64, s390, and loongarch, this_cpu_inc() wraps the underlying > > + * atomic instruction with preempt_disable/enable to prevent migration > > + * between the per-CPU address calculation and the atomic operation. > > + * SRCU does not need this protection because it sums counters across > > + * all CPUs for grace-period detection, so operating on a "stale" CPU's > > + * counter after migration is harmless. > > + * > > + * On arm64, use atomic_long_fetch_add_relaxed() which compiles to the > > + * value-returning ldadd instruction instead of atomic_long_inc()'s > > + * non-value-returning stadd, because ldadd is resolved in L1 cache > > + * whereas stadd may be resolved further out in the memory hierarchy. > > + * > > https://lore.kernel.org/r/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop > > + */ > > +static __always_inline void > > +srcu_percpu_counter_inc(atomic_long_t __percpu *v) > > +{ > > +#ifdef CONFIG_ARM64 > > + (void)atomic_long_fetch_add_relaxed(1, raw_cpu_ptr(v)); > > +#elif IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE) > > + atomic_long_inc(raw_cpu_ptr(v)); > > +#else > > + this_cpu_inc(v->counter); > > +#endif > > +} > > No, this is a hack. arm64 shouldn't be treated specially here. > > The ldadd issue was already fixed properly in > git.kernel.org/linus/535fdfc5a2285. If you want to improve our preempt > disable/enable code or add helpers that don't require that, then patches > are welcome, but bodging random callers with arch-specific code for a > micro-benchmark is completely the wrong approach.
Thanks for the feedback. I basically want to remove the overhead of preempt disable/enable that comes with this_cpu_*(), because in SRCU (and maybe at other places too) we don't need that safety. One way would be to define raw_cpu_add_* helpers in arch/arm64/include/asm/percpu.h but that wouldn't be good for existing callers of raw_cpu_add() as currently raw_cpu_add() resolves to raw_cpu_generic_to_op(pcp, val, +=), which is not atomic. Another way would be to add new helpers that do per-CPU atomics without preempt enable/disable. And do you think this optimization is worth doing? or should I just not do it? Thanks, Puranjay

