On Tue, Jan 06, 2026 at 10:08:51AM -0500, Joel Fernandes wrote:
> 
> 
> On 1/5/2026 7:55 PM, Joel Fernandes wrote:
> >> Also if so, would the following rather simpler patch do the same trick,
> >> if accompanied by CONFIG_RCU_FANOUT_LEAF=1?
> >>
> >> ------------------------------------------------------------------------
> >>
> >> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> >> index 6a319e2926589..04dbee983b37d 100644
> >> --- a/kernel/rcu/Kconfig
> >> +++ b/kernel/rcu/Kconfig
> >> @@ -198,9 +198,9 @@ config RCU_FANOUT
> >>  
> >>  config RCU_FANOUT_LEAF
> >>    int "Tree-based hierarchical RCU leaf-level fanout value"
> >> -  range 2 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
> >> -  range 2 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
> >> -  range 2 3 if RCU_STRICT_GRACE_PERIOD
> >> +  range 1 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
> >> +  range 1 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
> >> +  range 1 3 if RCU_STRICT_GRACE_PERIOD
> >>    depends on TREE_RCU && RCU_EXPERT>      default 16 if 
> >> !RCU_STRICT_GRACE_PERIOD
> >>    default 2 if RCU_STRICT_GRACE_PERIOD
> >>
> >> ------------------------------------------------------------------------
> >>
> >> This passes a quick 20-minute rcutorture smoke test.  Does it provide
> >> similar performance benefits?
> >
> > I tried this out, and it also brings down the contention and solves the 
> > problem
> > I saw (in testing so far).
> > 
> > Would this work also if the test had grace periods init/cleanup racing with
> > preempted RCU read-side critical sections? I'm doing longer tests now to 
> > see how
> > this performs under GP-stress, versus my solution. I am also seeing that 
> > with
> > just the node lists, not per-cpu list, I see a dramatic throughput drop 
> > after
> > some amount of time, but I can't explain it. And I do not see this with the
> > per-cpu list solution (I'm currently testing if I see the same throughput 
> > drop
> > with the fan-out solution you proposed).
> > 
> > I'm also wondering whether relying on the user to set FANOUT_LEAF to 1 is
> > reasonable, considering this is not a default. Are you suggesting 
> > defaulting to
> > this for small systems? If not, then I guess the optimization will not be
> > enabled by default. Eventually, with this patch set, if we are moving 
> > forward
> > with this approach, I will remove the config option for per-CPU block list
> > altogether so that it is enabled by default. That's kind of my plan if we 
> > agreed
> > on this, but it is just an RFC stage 🙂.
> 
> So the fanout solution works great when there are grace periods in progress. I
> see no throughput drop, and consistent performance with read site critical
> sections. However, if we switch to having no grace periods continuously
> happening in progress, I can see the throughput dropping quite a bit here
> (-30%). I can't explain that, but I do not see that issue with per-CPU lists.

Might this be due to the change in number of tasks?  Not having the
thread that continuously runs grace periods might be affecting scheduling
decisions, and with CPU overcommit, those scheduling decisions can cause
large changes in throughput.  Plus there are other spinlocks that might
be subject to vCPU preemption, including the various scheduler spinlocks.

> With the per-cpu list scheme, blocking does not involve the node at all, as 
> long
> as there is no grace period in progress. So, in that sense, per-CPU blocked 
> list
> is completely detached from RCU - it is a bit like lazy RCU in the sense 
> instead
> of a callback, it is the blocking task in a per-cpu list, relieving RCU of the
> burden.

Unless I am seriously misreading your patch, the grace-period kthread still
acquires your per-CPU locks.  Also, reducing the number of grace periods
should *reduce* contention on the rcu_node ->lock.

> Maybe the extra layer of the node tree (with fanout == 1) somehow adds
> unnecessary overhead that does not exist with Per CPU lists? Even though there
> is this throughput drop, it still does better than baseline with a common RCU 
> node.
> 
> Based on this, I would say per-cpu blocked list is still worth doing. 
> Thoughts?

I think that we need to understand the differences before jumping
to conclusions.  There are a lot of possible reasons for changes in
throughput, especially given the CPU overload.  After all, queuing
theory suggests high variance in that case, possibly even on exactly
the same setup.

                                                        Thanx, Paul

Reply via email to