lock contention with per-CPU blocked task lists

Paul E. McKenney Tue, 06 Jan 2026 12:47:13 -0800

On Tue, Jan 06, 2026 at 03:19:24PM -0500, Steven Rostedt wrote:
> On Tue, 6 Jan 2026 11:17:19 -0800
> "Paul E. McKenney" <[email protected]> wrote:
> 
> > > Interesting. I might look into the boosting logic to see whether we can 
> > > avoid
> > > boosting certain tasks depending on whether they help the grace period 
> > > complete
> > > or not. Thank you for the suggestion.  
> > 
> > Just so you know, all of my simplification efforts thus far have instead
> > made it more complex, but who knows what I might have been missing?
> 
> Maybe you are too smart to make it simple? ;-)


There is the old adage that the complexity of any software artifact
grows to just barely exceed the capabilities of those working on it.  ;-)

But all that aside, getting fresh eyes on it would be a good thing.

> > I could easily believe that the vCPU preemption problem needs to be
> > addressed, but doing so on a per-spinlock basis would lead to greatly
> > increased complexity throughout the kernel, not just RCU.
> 
> Agreed.
> 
> > > > The main point of this patch series is to avoid lock contention due to
> > > > vCPU preemption, correct?  If so, will we need similar work on the other
> > > > locks in the Linux kernel, both within RCU and elsewhere?  I vaguely
> > > > recall your doing some work along those lines a few years back, and
> > > > maybe Thomas Gleixner's deferred-preemption work could help with this.
> > > > Or not, who knows?  Keeping the hypervisor informed of lock state is
> > > > not necessarily free.  
> > > 
> > > Yes, I did some work on this at Google, but it turned out to be a very
> > > fragmented effort in terms of where (which subsystem - KVM, scheduler etc)
> > > should we do the priority boosting of vCPU threads. In the end, we just 
> > > ended up
> > > with an internal prototype that was not upstreamable but worked pretty 
> > > well and
> > > only had time for production (a lesson I learned there is we should 
> > > probably
> > > work on upstream solutions first, but life is not that easy sometimes).  
> > 
> > Which is one reason deferred preemption would be attractive.
> 
> Yes. That's why I've been pushing it.

Very good to hear!

> > > About the deferred-preemption, I believe Steven Rostedt at one point was 
> > > looking
> > > at that for VMs, but that effort stalled as Peter is concerned about 
> > > doing that
> > > would mess up the scheduler. The idea (AFAIU) is to use the rseq page to
> > > communicate locking information between vCPU threads and the host and 
> > > then let
> > > the host avoid vCPU preemption - but the scheduler needs to do something 
> > > with
> > > that information. Otherwise, it's no use.  
> > 
> > Has deferred preemption for userspace locking also stalled?  If not,
> > then the scheduler's support for userspace should apply directly to
> > guest OSes, right?
> 
> No, the user space deferred preemption is still moving along nicely (I
> believe Thomas has completed most of it). The issue here is that the
> deferred happens before going back to user space. That's a different
> location than going back to the guest. The logic needs to be in that path
> too.

OK, got it, thank you!

> One thing that Peter Zijlstra pushed was the limited amount of time that
> deferred wait may happen. He says user space spinlocks are a bad design,
> but it has been proven for that they are currently the most efficient when
> coming to very short critical sections. That is, where the critical section
> is shorter than the cost of a system call. Thus, he forces the deferred
> scheduling to be at most 50us max (he's also suggested less than that).
> 
> But when it comes to the guest, where kernel spinlocks are user space
> spinlocks, and can be held for more than 50us, I would like a way to have
> the guests defer the scheduling for even longer than user space spin locks.

I would *hope* that the rcu_node ->lock instances are held for less
than 50us!  At least in the absence of SMIs, NMIs, or vCPU preemption on
systems with at least 100MHz core CPU clock frequency.  Besides, SMIs,
NMIs, vCPU preemption affect userspace locks, as do IRQs and softirqs.

Of course, hope springs eternal...

                                                        Thanx, Paul

Re: [PATCH RFC 00/14] rcu: Reduce rnp->lock contention with per-CPU blocked task lists

Reply via email to