When a task is preempted while holding an RCU read-side lock, the kernel
must track it on the rcu_node's blocked task list. This requires acquiring
rnp->lock shared by all CPUs in that node's subtree.
Posting this as RFC for early feedback. There could be bugs lurking,
especially related to expedited GPs which I have not yet taken a close
look at. Several TODOs are added. It passed light TREE03 rcutorture
testing.
On systems with 16 or fewer CPUs, the RCU hierarchy often has just a single
rcu_node, making rnp->lock effectively a global lock for all blocked task
operations. Every context switch where a task holds an RCU read-side lock
contends on this single lock.
Enter Virtualization
--------------------
In virtualized environments, the problem becomes dramatically worse due to vCPU
preemption. Research from USENIX ATC'17 ("The RCU-Reader Preemption Problem in
VMs" by Gopinath and Paul McKenney) [1] explores the issue that RCU
reader preemption in VMs causes multi-second latency spikes and huge increases
in grace period duration.
When a vCPU is preempted by the hypervisor while holding rnp->lock, other
vCPUs spin waiting for a lock holder that isn't even running. In testing
with host RT preemptors to inject vCPU preemption, lock hold times extended
from ~4us to over 4000us - a 1000x increase.
The Solution
------------
This series introduces per-CPU lists for tracking blocked RCU readers. The
key insight is that when no grace period is active, blocked tasks complete
their critical sections before really requiring any rnp locking.
1. Fast path: At context switch, Add the task only to the
per-CPU list - no rnp->lock needed.
2. Promotion on demand: When a grace period starts, promote tasks from
per-CPU lists to the rcu_node list.
3. Normal path: If a grace period is already waiting, tasks go directly
to the rcu_node list as before.
Results
-------
Testing with 64 reader threads under vCPU preemption from 32 host SCHED_FIFO
preemptors), 100 runs each. Throughput measured of read lock/unlock iterations
per second.
Baseline Optimized
Mean throughput 66,980 iter/s 97,719 iter/s (+46%)
Lock hold time (mean) 1,069 us ~0 us
The optimized version maintains stable performance with essentially close to
zero rnp->lock overhead.
rcutorture Testing
------------------
TREE03 Testing with rcutorture without RCU or hotplug errors. More testing is
in progress.
Note: I have added a CONFIG_RCU_PER_CPU_BLOCKED_LISTS to guard the feature but
the plan is to eventually turn this on all the time.
[1]
https://www.usenix.org/conference/atc17/technical-sessions/presentation/prasad
Joel Fernandes (14):
rcu: Add WARN_ON_ONCE for blocked flag invariant in exit_rcu()
rcu: Add per-CPU blocked task lists for PREEMPT_RCU
rcu: Early return during unlock for tasks only on per-CPU blocked list
rcu: Promote blocked tasks from per-CPU to rnp lists
rcu: Promote blocked tasks for expedited GPs
rcu: Promote per-CPU blocked tasks before checking for blocked readers
rcu: Promote late-arriving blocked tasks before reporting QS
rcu: Promote blocked tasks before QS report in force_qs_rnp()
rcu: Promote blocked tasks before QS report in
rcutree_report_cpu_dead()
rcu: Promote blocked tasks before QS report in rcu_gp_init()
rcu: Add per-CPU blocked list check in exit_rcu()
rcu: Skip per-CPU list addition when GP already started
rcu: Skip rnp addition when no grace period waiting
rcu: Remove checking of per-cpu blocked list against the node list
include/linux/sched.h | 4 +
kernel/fork.c | 4 +
kernel/rcu/Kconfig | 12 +++
kernel/rcu/tree.c | 60 +++++++++--
kernel/rcu/tree.h | 11 +-
kernel/rcu/tree_exp.h | 5 +
kernel/rcu/tree_plugin.h | 211 +++++++++++++++++++++++++++++++++++----
kernel/rcu/tree_stall.h | 4 +-
8 files changed, 279 insertions(+), 32 deletions(-)
base-commit: f8f9c1f4d0c7a64600e2ca312dec824a0bc2f1da
--
2.34.1