On Tue, Jan 06, 2026 at 04:24:07PM -0500, Joel Fernandes wrote:
> 
> 
> On 1/6/2026 2:24 PM, Paul E. McKenney wrote:
> > On Tue, Jan 06, 2026 at 10:08:51AM -0500, Joel Fernandes wrote:
> >>
> >>
> >> On 1/5/2026 7:55 PM, Joel Fernandes wrote:
> >>>> Also if so, would the following rather simpler patch do the same trick,
> >>>> if accompanied by CONFIG_RCU_FANOUT_LEAF=1?
> >>>>
> >>>> ------------------------------------------------------------------------
> >>>>
> >>>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> >>>> index 6a319e2926589..04dbee983b37d 100644
> >>>> --- a/kernel/rcu/Kconfig
> >>>> +++ b/kernel/rcu/Kconfig
> >>>> @@ -198,9 +198,9 @@ config RCU_FANOUT
> >>>>  
> >>>>  config RCU_FANOUT_LEAF
> >>>>          int "Tree-based hierarchical RCU leaf-level fanout value"
> >>>> -        range 2 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
> >>>> -        range 2 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
> >>>> -        range 2 3 if RCU_STRICT_GRACE_PERIOD
> >>>> +        range 1 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
> >>>> +        range 1 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
> >>>> +        range 1 3 if RCU_STRICT_GRACE_PERIOD
> >>>>          depends on TREE_RCU && RCU_EXPERT>      default 16 if 
> >>>> !RCU_STRICT_GRACE_PERIOD
> >>>>          default 2 if RCU_STRICT_GRACE_PERIOD
> >>>>
> >>>> ------------------------------------------------------------------------
> >>>>
> >>>> This passes a quick 20-minute rcutorture smoke test.  Does it provide
> >>>> similar performance benefits?
> >>>
> >>> I tried this out, and it also brings down the contention and solves the 
> >>> problem
> >>> I saw (in testing so far).
> >>>
> >>> Would this work also if the test had grace periods init/cleanup racing 
> >>> with
> >>> preempted RCU read-side critical sections? I'm doing longer tests now to 
> >>> see how
> >>> this performs under GP-stress, versus my solution. I am also seeing that 
> >>> with
> >>> just the node lists, not per-cpu list, I see a dramatic throughput drop 
> >>> after
> >>> some amount of time, but I can't explain it. And I do not see this with 
> >>> the
> >>> per-cpu list solution (I'm currently testing if I see the same throughput 
> >>> drop
> >>> with the fan-out solution you proposed).
> >>>
> >>> I'm also wondering whether relying on the user to set FANOUT_LEAF to 1 is
> >>> reasonable, considering this is not a default. Are you suggesting 
> >>> defaulting to
> >>> this for small systems? If not, then I guess the optimization will not be
> >>> enabled by default. Eventually, with this patch set, if we are moving 
> >>> forward
> >>> with this approach, I will remove the config option for per-CPU block list
> >>> altogether so that it is enabled by default. That's kind of my plan if we 
> >>> agreed
> >>> on this, but it is just an RFC stage 🙂.
> >>
> >> So the fanout solution works great when there are grace periods in 
> >> progress. I
> >> see no throughput drop, and consistent performance with read site critical
> >> sections. However, if we switch to having no grace periods continuously
> >> happening in progress, I can see the throughput dropping quite a bit here
> >> (-30%). I can't explain that, but I do not see that issue with per-CPU 
> >> lists.
> > 
> > Might this be due to the change in number of tasks?  Not having the
> > thread that continuously runs grace periods might be affecting scheduling
> > decisions, and with CPU overcommit, those scheduling decisions can cause
> > large changes in throughput.  Plus there are other spinlocks that might
> > be subject to vCPU preemption, including the various scheduler spinlocks.
> 
> Yeah these are all possible, currently studying it more :)

Looking forward to seeing what you find!

> >> With the per-cpu list scheme, blocking does not involve the node at all, 
> >> as long
> >> as there is no grace period in progress. So, in that sense, per-CPU 
> >> blocked list
> >> is completely detached from RCU - it is a bit like lazy RCU in the sense 
> >> instead
> >> of a callback, it is the blocking task in a per-cpu list, relieving RCU of 
> >> the
> >> burden.
> > 
> > Unless I am seriously misreading your patch, the grace-period kthread still
> > acquires your per-CPU locks.
> 
> Yes, but I am not triggering grace periods (in the tests where I am expecting 
> an
> improvement). It is in those tests that I am seeing the throughput drop with
> FANOUT, but let me confirm that again. I did run it 200 times and notice this.
> I'm not sure what else a fanout of one for leaves does, but this is my chance 
> to
> learn about it :).

Well, TREE09 has tested at least one aspect of this configuration quite
thoroughly over the years.  ;-)

> I am saying when there is no GPs active (that is when the optimization in 
> these
> patches is active). In one of the patches, if grace period is in progress or
> already started, I do not trigger the optimization. The optimization is only
> when grace periods are not active. This is similar to lazy RCU, where, if we
> have active grace periods in progress, we don't really make new RCU callbacks
> lazy since it is pointless.

Interesting.  When there is no grace period is also when it is least
harmful to acquire the rcu_node structure's ->lock.

> > Also, reducing the number of grace periods> should *reduce* contention on 
> > the
> rcu_node ->lock.
> > 
> >> Maybe the extra layer of the node tree (with fanout == 1) somehow adds
> >> unnecessary overhead that does not exist with Per CPU lists? Even though 
> >> there
> >> is this throughput drop, it still does better than baseline with a common 
> >> RCU node.
> >>
> >> Based on this, I would say per-cpu blocked list is still worth doing. 
> >> Thoughts?
> > 
> > I think that we need to understand the differences before jumping
> > to conclusions.  There are a lot of possible reasons for changes in
> > throughput, especially given the CPU overload.  After all, queuing
> > theory suggests high variance in that case, possibly even on exactly
> > the same setup.
> 
> Sure, that's why I'm doing hundreds of runs to get repetitive results and cut
> back on the outliers. But it is quite challenging to study all possibilities
> given the time constraints. I'm trying to collect traces as much as I can and
> study them. The synchronize RCU latency that I just improved, for instance, 
> came
> from one of such exercises.

There is absolutely nothing wrong with experiments!

                                                        Thanx, Paul

Reply via email to