On Tue, Jan 06, 2026 at 03:40:04PM -0500, Joel Fernandes wrote:
> On 1/6/2026 2:17 PM, Paul E. McKenney wrote:
> > On Mon, Jan 05, 2026 at 07:55:18PM -0500, Joel Fernandes wrote:
> [..]
> >>>> The optimized version maintains stable performance with essentially
> >>>> close to
> >>>> zero rnp->lock overhead.
> >>>>
> >>>> rcutorture Testing
> >>>> ------------------
> >>>> TREE03 Testing with rcutorture without RCU or hotplug errors. More
> >>>> testing is
> >>>> in progress.
> >>>>
> >>>> Note: I have added a CONFIG_RCU_PER_CPU_BLOCKED_LISTS to guard the
> >>>> feature but
> >>>> the plan is to eventually turn this on all the time.
> >>>
> >>> Yes, Aravinda, Gopinath, and I did publish that paper back in the day
> >>> (with Aravinda having done almost all the work), but it was an artificial
> >>> workload. Which is OK given that it was an academic effort. It has also
> >>> provided some entertainment, for example, an audience member asking me
> >>> if I was aware of this work in a linguistic-kill-shot manner. ;-)
> >>>
> >>> So are we finally seeing this effect in the wild?
> >>
> >> This patch set is also targeting a synthetic test I wrote to see if I could
> >> reproduce a preemption problem. I know several instances over the years
> >> where my
> >> teams (mainly at Google) were trying to resolve spin lock preemption inside
> >> virtual machines by boosting vCPU threads. In the spirit of RCU
> >> performance and
> >> VMs, we should probably optimize node locking IMO, but I do see your point
> >> of
> >> view about optimizing real-world use cases as well.
> >
> > Also taking care of all spinlocks instead of doing large numbers of
> > per-spinlock workarounds would be good. There are a *lot* of spinlocks
> > in the Linux kernel!
>
> I wouldn't call it a workaround yet. Avoiding lock contention by using per CPU
> list is an optimization we have done before right? (Example the synthetic RCU
> callback-flooding use case where we used a per-cpu list). We can call it
> defensive programming, if you will. ;-) Especially in the scheduler hot path
> where we are blocking/preempting. Again, I'm not saying we should do it for
> this
> case since we are still studying the issue, but just on the fact that we are
> optimizing a spin lock we acquire *a lot* shouldn't be categorized as a
> workaround in my opinion.
>
> This blocking is even more likely on preempt RT in read-side critical
> sections.
> Again, I'm not saying that we should do this optimization, but I don't think
> we
> can ignore it. At least not based on the data I have so far.
If the main motivation is vCPU preemption, I consider this to be a
workaround for the lack of awareness of guest-OS locks by the host OS.
If there is some other reasonable way of generating contention on
this lock, then per-CPU locking is one specific way of addressing that
contention. As is reducing the value of CONFIG_RCU_FANOUT_LEAF and who
knows what all else.
> >> What bothers me about the current state of affairs is that even without any
> >> grace period in progress, any task blocking in an RCU Read Side critical
> >> section
> >> will take a (almost-)global lock that is shared by other CPUs who might
> >> also be
> >> preempting/blocking RCU readers. Further, if this happens to be a vCPU
> >> that was
> >> preempted while holding the node lock, then every other vCPU thread that
> >> blocks
> >> in an RCU critical section will also block and end up slowing preemption
> >> down in
> >> the vCPU. My preference would be to keep the readers fast while moving the
> >> overhead to the slow path (the overhead being promoting tasks at the right
> >> time
> >> that were blocked). In fact, in these patches, I'm directly going to the
> >> node
> >> list if there is a grace period in progress.
> >
> > Not "(almost-)global"!
> >
> > That lock replicates itself automatically with increasing numbers of CPUs.
> > That 16 used to be the full (at the time) 32-bit cpumask, but we decreased
> > it to 16 based on performance feedback from Andi Kleen back in the day.
> > If we are seeing real-world contention on that lock in real-world
> > workloads on real-world systems, further adjustments could be made,
> > either reducing CONFIG_RCU_FANOUT_LEAF further or offloading the lock,
> > where your series is one example of the latter.
>
> I meant it is global or almost global depending on the number of CPUs. So for
> example on an 8 CPU system with the default fanout, it is a global lock,
> correct?
Yes, but only assuming the default CONFIG_RCU_FANOUT_LEAF value of
16, or some other value of 8 or larger. But in that case, there are
only 8 CPUs contending for that lock, so is there really a problem?
(In the absence of vCPU contention, that is.) And the default value
can be changed if needed.
> > I could easily believe that the vCPU preemption problem needs to be
> > addressed, but doing so on a per-spinlock basis would lead to greatly
> > increased complexity throughout the kernel, not just RCU.
>
> I agree with this. I was not intending to solve this for the entire kernel, at
> first at least.
If addressing vCPU contention is the goal, how many locks are individually
adjusted before solving it for the whole kernel becomes easier and
less complex?
> >> About the deferred-preemption, I believe Steven Rostedt at one point was
> >> looking
> >> at that for VMs, but that effort stalled as Peter is concerned about doing
> >> that
> >> would mess up the scheduler. The idea (AFAIU) is to use the rseq page to
> >> communicate locking information between vCPU threads and the host and then
> >> let
> >> the host avoid vCPU preemption - but the scheduler needs to do something
> >> with
> >> that information. Otherwise, it's no use.
> >
> > Has deferred preemption for userspace locking also stalled? If not,
> > then the scheduler's support for userspace should apply directly to
> > guest OSes, right?
>
> I don't think there have been any user space locking optimizations for
> preemption that has made it upstream (AFAIK). I know there were efforts, but I
> could be out of date there. I think the devil is in the details as well
> because
> user space optimizations cannot always be applied to guests in my experience.
> The VM exit path and the syscall entry/exit paths are quite different,
> including
> the API boundary.
Thomas and Steve are having another go at the userspace portion of
this problem. Should that make it in, the guest-OS portion might not
be that big an ask.
> >>> Also if so, would the following rather simpler patch do the same trick,
> >>> if accompanied by CONFIG_RCU_FANOUT_LEAF=1?
> >>>
> >>> ------------------------------------------------------------------------
> >>>
> >>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> >>> index 6a319e2926589..04dbee983b37d 100644
> >>> --- a/kernel/rcu/Kconfig
> >>> +++ b/kernel/rcu/Kconfig
> >>> @@ -198,9 +198,9 @@ config RCU_FANOUT
> >>>
> >>> config RCU_FANOUT_LEAF
> >>> int "Tree-based hierarchical RCU leaf-level fanout value"
> >>> - range 2 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
> >>> - range 2 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
> >>> - range 2 3 if RCU_STRICT_GRACE_PERIOD
> >>> + range 1 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
> >>> + range 1 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
> >>> + range 1 3 if RCU_STRICT_GRACE_PERIOD
> >>> depends on TREE_RCU && RCU_EXPERT> default 16 if
> >>> !RCU_STRICT_GRACE_PERIOD
> >>> default 2 if RCU_STRICT_GRACE_PERIOD
> >>>
> >>> ------------------------------------------------------------------------
> >>>
> >>> This passes a quick 20-minute rcutorture smoke test. Does it provide
> >>> similar performance benefits?
> >>
> >> I tried this out, and it also brings down the contention and solves the
> >> problem
> >> I saw (in testing so far).
> >>
> >> Would this work also if the test had grace periods init/cleanup racing with
> >> preempted RCU read-side critical sections? I'm doing longer tests now to
> >> see how
> >> this performs under GP-stress, versus my solution. I am also seeing that
> >> with
> >> just the node lists, not per-cpu list, I see a dramatic throughput drop
> >> after
> >> some amount of time, but I can't explain it. And I do not see this with the
> >> per-cpu list solution (I'm currently testing if I see the same throughput
> >> drop
> >> with the fan-out solution you proposed).
> >
> > Might the throughput drop be due to increased load on the host?
>
> The load is constant with the benchmark, and the data is repeatable and
> consistent. So random load on the host is unlikely.
So you have a system with the various background threads corralled or
disabled?
> > Another possibility is that tasks/vCPUs got shuffled so as to increase
> > the probability of preemption.
> >
> > Also, doesn't your patch also cause the grace-period kthread to acquire
> > that per-CPU lock, thus also possibly resulting in contention, vCPU
> > preemption, and so on?
>
> Yes, I'm tracing it more. Even with baseline (without these patches), I see
> this
> throughput drop so it is worth investigating. I think it's something possibly
> like a lock convoy forming, but the fact that if I don't use RNP locking, the
> lock convoy disappears, and the throughput is completely stable. That tells me
> that that has something to do with that or something related. I also measured
> the exact RNP lock time and counted the number of contentions, so I am not
> really guessing here. The RNP lock is contended consistently. I think it's a
> great idea for me to extend this lock contention measurement to the run queue
> locks as well, for me to measure how they are doing (or even extending it to
> all
> locks, as you mentioned) - at least for me to confirm the theory that the same
> test severely contends other locks as well.
Is the RNP lock contended under non-overload conditions? If I remember
correctly, you were running 2x CPU overload. Is the RNP lock contended
in bare-metal kernels?
Thanx, Paul
> >> I'm also wondering whether relying on the user to set FANOUT_LEAF to 1 is
> >> reasonable, considering this is not a default. Are you suggesting
> >> defaulting to
> >> this for small systems? If not, then I guess the optimization will not be
> >> enabled by default. Eventually, with this patch set, if we are moving
> >> forward
> >> with this approach, I will remove the config option for per-CPU block list
> >> altogether so that it is enabled by default. That's kind of my plan if we
> >> agreed
> >> on this, but it is just an RFC stage :).
> >
> > Right now, we are experimenting, so the usability issue is less pressing.
> > Once we find out what is really going on for real-world systems, we
> > can make adjustments if and as appropriate, said adjustments including
> > usability.
>
> Sure, thanks.
>
> - Joel
>