Hi Paul,

On 1/5/2026 11:46 AM, Paul E. McKenney wrote:
> On Fri, Jan 02, 2026 at 07:23:29PM -0500, Joel Fernandes wrote:
>> When a task is preempted while holding an RCU read-side lock, the kernel
>> must track it on the rcu_node's blocked task list. This requires acquiring
>> rnp->lock shared by all CPUs in that node's subtree.
>>
>> Posting this as RFC for early feedback. There could be bugs lurking,
>> especially related to expedited GPs which I have not yet taken a close
>> look at. Several TODOs are added. It passed light TREE03 rcutorture
>> testing.
>>
>> On systems with 16 or fewer CPUs, the RCU hierarchy often has just a single
>> rcu_node, making rnp->lock effectively a global lock for all blocked task
>> operations. Every context switch where a task holds an RCU read-side lock
>> contends on this single lock.
>>
>> Enter Virtualization
>> --------------------
>> In virtualized environments, the problem becomes dramatically worse due to 
>> vCPU
>> preemption. Research from USENIX ATC'17 ("The RCU-Reader Preemption Problem 
>> in
>> VMs" by Gopinath and Paul McKenney) [1] explores the issue that RCU
>> reader preemption in VMs causes multi-second latency spikes and huge 
>> increases
>> in grace period duration.
>>
>> When a vCPU is preempted by the hypervisor while holding rnp->lock, other
>> vCPUs spin waiting for a lock holder that isn't even running. In testing
>> with host RT preemptors to inject vCPU preemption, lock hold times extended
>> from ~4us to over 4000us - a 1000x increase.
>>
>> The Solution
>> ------------
>> This series introduces per-CPU lists for tracking blocked RCU readers. The
>> key insight is that when no grace period is active, blocked tasks complete
>> their critical sections before really requiring any rnp locking.
>>
>> 1. Fast path: At context switch, Add the task only to the
>>    per-CPU list - no rnp->lock needed.
>>
>> 2. Promotion on demand: When a grace period starts, promote tasks from
>>    per-CPU lists to the rcu_node list.
>>
>> 3. Normal path: If a grace period is already waiting, tasks go directly
>>    to the rcu_node list as before.
>>
>> Results
>> -------
>> Testing with 64 reader threads under vCPU preemption from 32 host SCHED_FIFO
>> preemptors), 100 runs each. Throughput measured of read lock/unlock 
>> iterations
>> per second.
>>
>>                         Baseline        Optimized
>> Mean throughput         66,980 iter/s   97,719 iter/s   (+46%)
>> Lock hold time (mean)   1,069 us        ~0 us
> 
> Excellent performance improvement!

Thanks. :)
> It would be good to simplify the management of the blocked-tasks lists,
> and to make it more exact, as in never unnecessarily priority-boost
> a task.  But it is not like people have been complaining, at least not
> to me.  And earlier attempts in that direction added more mess than
> simplification.  :-(

Interesting. I might look into the boosting logic to see whether we can avoid
boosting certain tasks depending on whether they help the grace period complete
or not. Thank you for the suggestion.

>> The optimized version maintains stable performance with essentially close to
>> zero rnp->lock overhead.
>>
>> rcutorture Testing
>> ------------------
>> TREE03 Testing with rcutorture without RCU or hotplug errors. More testing is
>> in progress.
>>
>> Note: I have added a CONFIG_RCU_PER_CPU_BLOCKED_LISTS to guard the feature 
>> but
>> the plan is to eventually turn this on all the time.
> 
> Yes, Aravinda, Gopinath, and I did publish that paper back in the day
> (with Aravinda having done almost all the work), but it was an artificial
> workload.  Which is OK given that it was an academic effort.  It has also
> provided some entertainment, for example, an audience member asking me
> if I was aware of this work in a linguistic-kill-shot manner.  ;-)
> 
> So are we finally seeing this effect in the wild?

This patch set is also targeting a synthetic test I wrote to see if I could
reproduce a preemption problem. I know several instances over the years where my
teams (mainly at Google) were trying to resolve spin lock preemption inside
virtual machines by boosting vCPU threads. In the spirit of RCU performance and
VMs, we should probably optimize node locking IMO, but I do see your point of
view about optimizing real-world use cases as well.

What bothers me about the current state of affairs is that even without any
grace period in progress, any task blocking in an RCU Read Side critical section
will take a (almost-)global lock that is shared by other CPUs who might also be
preempting/blocking RCU readers. Further, if this happens to be a vCPU that was
preempted while holding the node lock, then every other vCPU thread that blocks
in an RCU critical section will also block and end up slowing preemption down in
the vCPU. My preference would be to keep the readers fast while moving the
overhead to the slow path (the overhead being promoting tasks at the right time
that were blocked). In fact, in these patches, I'm directly going to the node
list if there is a grace period in progress.

> The main point of this patch series is to avoid lock contention due to
> vCPU preemption, correct?  If so, will we need similar work on the other
> locks in the Linux kernel, both within RCU and elsewhere?  I vaguely
> recall your doing some work along those lines a few years back, and
> maybe Thomas Gleixner's deferred-preemption work could help with this.
> Or not, who knows?  Keeping the hypervisor informed of lock state is
> not necessarily free.

Yes, I did some work on this at Google, but it turned out to be a very
fragmented effort in terms of where (which subsystem - KVM, scheduler etc)
should we do the priority boosting of vCPU threads. In the end, we just ended up
with an internal prototype that was not upstreamable but worked pretty well and
only had time for production (a lesson I learned there is we should probably
work on upstream solutions first, but life is not that easy sometimes).

About the deferred-preemption, I believe Steven Rostedt at one point was looking
at that for VMs, but that effort stalled as Peter is concerned about doing that
would mess up the scheduler. The idea (AFAIU) is to use the rseq page to
communicate locking information between vCPU threads and the host and then let
the host avoid vCPU preemption - but the scheduler needs to do something with
that information. Otherwise, it's no use.

> Also if so, would the following rather simpler patch do the same trick,
> if accompanied by CONFIG_RCU_FANOUT_LEAF=1?
> 
> ------------------------------------------------------------------------
> 
> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> index 6a319e2926589..04dbee983b37d 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -198,9 +198,9 @@ config RCU_FANOUT
>  
>  config RCU_FANOUT_LEAF
>       int "Tree-based hierarchical RCU leaf-level fanout value"
> -     range 2 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
> -     range 2 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
> -     range 2 3 if RCU_STRICT_GRACE_PERIOD
> +     range 1 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
> +     range 1 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
> +     range 1 3 if RCU_STRICT_GRACE_PERIOD
>       depends on TREE_RCU && RCU_EXPERT>      default 16 if 
> !RCU_STRICT_GRACE_PERIOD
>       default 2 if RCU_STRICT_GRACE_PERIOD
> 
> ------------------------------------------------------------------------
> 
> This passes a quick 20-minute rcutorture smoke test.  Does it provide
> similar performance benefits?

I tried this out, and it also brings down the contention and solves the problem
I saw (in testing so far).

Would this work also if the test had grace periods init/cleanup racing with
preempted RCU read-side critical sections? I'm doing longer tests now to see how
this performs under GP-stress, versus my solution. I am also seeing that with
just the node lists, not per-cpu list, I see a dramatic throughput drop after
some amount of time, but I can't explain it. And I do not see this with the
per-cpu list solution (I'm currently testing if I see the same throughput drop
with the fan-out solution you proposed).

I'm also wondering whether relying on the user to set FANOUT_LEAF to 1 is
reasonable, considering this is not a default. Are you suggesting defaulting to
this for small systems? If not, then I guess the optimization will not be
enabled by default. Eventually, with this patch set, if we are moving forward
with this approach, I will remove the config option for per-CPU block list
altogether so that it is enabled by default. That's kind of my plan if we agreed
on this, but it is just an RFC stage :).

thanks,

 - Joel



Reply via email to