On Fri, Jan 02, 2026 at 07:23:29PM -0500, Joel Fernandes wrote:
> When a task is preempted while holding an RCU read-side lock, the kernel
> must track it on the rcu_node's blocked task list. This requires acquiring
> rnp->lock shared by all CPUs in that node's subtree.
> 
> Posting this as RFC for early feedback. There could be bugs lurking,
> especially related to expedited GPs which I have not yet taken a close
> look at. Several TODOs are added. It passed light TREE03 rcutorture
> testing.
> 
> On systems with 16 or fewer CPUs, the RCU hierarchy often has just a single
> rcu_node, making rnp->lock effectively a global lock for all blocked task
> operations. Every context switch where a task holds an RCU read-side lock
> contends on this single lock.
> 
> Enter Virtualization
> --------------------
> In virtualized environments, the problem becomes dramatically worse due to 
> vCPU
> preemption. Research from USENIX ATC'17 ("The RCU-Reader Preemption Problem in
> VMs" by Gopinath and Paul McKenney) [1] explores the issue that RCU
> reader preemption in VMs causes multi-second latency spikes and huge increases
> in grace period duration.
> 
> When a vCPU is preempted by the hypervisor while holding rnp->lock, other
> vCPUs spin waiting for a lock holder that isn't even running. In testing
> with host RT preemptors to inject vCPU preemption, lock hold times extended
> from ~4us to over 4000us - a 1000x increase.
> 
> The Solution
> ------------
> This series introduces per-CPU lists for tracking blocked RCU readers. The
> key insight is that when no grace period is active, blocked tasks complete
> their critical sections before really requiring any rnp locking.
> 
> 1. Fast path: At context switch, Add the task only to the
>    per-CPU list - no rnp->lock needed.
> 
> 2. Promotion on demand: When a grace period starts, promote tasks from
>    per-CPU lists to the rcu_node list.
> 
> 3. Normal path: If a grace period is already waiting, tasks go directly
>    to the rcu_node list as before.
> 
> Results
> -------
> Testing with 64 reader threads under vCPU preemption from 32 host SCHED_FIFO
> preemptors), 100 runs each. Throughput measured of read lock/unlock iterations
> per second.
> 
>                         Baseline        Optimized
> Mean throughput         66,980 iter/s   97,719 iter/s   (+46%)
> Lock hold time (mean)   1,069 us        ~0 us

Excellent performance improvement!

It would be good to simplify the management of the blocked-tasks lists,
and to make it more exact, as in never unnecessarily priority-boost
a task.  But it is not like people have been complaining, at least not
to me.  And earlier attempts in that direction added more mess than
simplification.  :-(

> The optimized version maintains stable performance with essentially close to
> zero rnp->lock overhead.
> 
> rcutorture Testing
> ------------------
> TREE03 Testing with rcutorture without RCU or hotplug errors. More testing is
> in progress.
> 
> Note: I have added a CONFIG_RCU_PER_CPU_BLOCKED_LISTS to guard the feature but
> the plan is to eventually turn this on all the time.

Yes, Aravinda, Gopinath, and I did publish that paper back in the day
(with Aravinda having done almost all the work), but it was an artificial
workload.  Which is OK given that it was an academic effort.  It has also
provided some entertainment, for example, an audience member asking me
if I was aware of this work in a linguistic-kill-shot manner.  ;-)

So are we finally seeing this effect in the wild?

The main point of this patch series is to avoid lock contention due to
vCPU preemption, correct?  If so, will we need similar work on the other
locks in the Linux kernel, both within RCU and elsewhere?  I vaguely
recall your doing some work along those lines a few years back, and
maybe Thomas Gleixner's deferred-preemption work could help with this.
Or not, who knows?  Keeping the hypervisor informed of lock state is
not necessarily free.

Also if so, would the following rather simpler patch do the same trick,
if accompanied by CONFIG_RCU_FANOUT_LEAF=1?

------------------------------------------------------------------------

diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 6a319e2926589..04dbee983b37d 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -198,9 +198,9 @@ config RCU_FANOUT
 
 config RCU_FANOUT_LEAF
        int "Tree-based hierarchical RCU leaf-level fanout value"
-       range 2 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
-       range 2 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
-       range 2 3 if RCU_STRICT_GRACE_PERIOD
+       range 1 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
+       range 1 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
+       range 1 3 if RCU_STRICT_GRACE_PERIOD
        depends on TREE_RCU && RCU_EXPERT
        default 16 if !RCU_STRICT_GRACE_PERIOD
        default 2 if RCU_STRICT_GRACE_PERIOD

------------------------------------------------------------------------

This passes a quick 20-minute rcutorture smoke test.  Does it provide
similar performance benefits?

                                                        Thanx, Paul

> [1] 
> https://www.usenix.org/conference/atc17/technical-sessions/presentation/prasad
> 
> Joel Fernandes (14):
>   rcu: Add WARN_ON_ONCE for blocked flag invariant in exit_rcu()
>   rcu: Add per-CPU blocked task lists for PREEMPT_RCU
>   rcu: Early return during unlock for tasks only on per-CPU blocked list
>   rcu: Promote blocked tasks from per-CPU to rnp lists
>   rcu: Promote blocked tasks for expedited GPs
>   rcu: Promote per-CPU blocked tasks before checking for blocked readers
>   rcu: Promote late-arriving blocked tasks before reporting QS
>   rcu: Promote blocked tasks before QS report in force_qs_rnp()
>   rcu: Promote blocked tasks before QS report in
>     rcutree_report_cpu_dead()
>   rcu: Promote blocked tasks before QS report in rcu_gp_init()
>   rcu: Add per-CPU blocked list check in exit_rcu()
>   rcu: Skip per-CPU list addition when GP already started
>   rcu: Skip rnp addition when no grace period waiting
>   rcu: Remove checking of per-cpu blocked list against the node list
> 
>  include/linux/sched.h    |   4 +
>  kernel/fork.c            |   4 +
>  kernel/rcu/Kconfig       |  12 +++
>  kernel/rcu/tree.c        |  60 +++++++++--
>  kernel/rcu/tree.h        |  11 +-
>  kernel/rcu/tree_exp.h    |   5 +
>  kernel/rcu/tree_plugin.h | 211 +++++++++++++++++++++++++++++++++++----
>  kernel/rcu/tree_stall.h  |   4 +-
>  8 files changed, 279 insertions(+), 32 deletions(-)
> 
> 
> base-commit: f8f9c1f4d0c7a64600e2ca312dec824a0bc2f1da
> --
> 2.34.1
> 

Reply via email to