lock contention with per-CPU blocked task lists

Joel Fernandes Tue, 06 Jan 2026 13:31:22 -0800

On 1/6/2026 2:24 PM, Paul E. McKenney wrote:
> On Tue, Jan 06, 2026 at 10:08:51AM -0500, Joel Fernandes wrote:
>>
>>
>> On 1/5/2026 7:55 PM, Joel Fernandes wrote:
>>>> Also if so, would the following rather simpler patch do the same trick,
>>>> if accompanied by CONFIG_RCU_FANOUT_LEAF=1?
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
>>>> index 6a319e2926589..04dbee983b37d 100644
>>>> --- a/kernel/rcu/Kconfig
>>>> +++ b/kernel/rcu/Kconfig
>>>> @@ -198,9 +198,9 @@ config RCU_FANOUT
>>>>  
>>>>  config RCU_FANOUT_LEAF
>>>>    int "Tree-based hierarchical RCU leaf-level fanout value"
>>>> -  range 2 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
>>>> -  range 2 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
>>>> -  range 2 3 if RCU_STRICT_GRACE_PERIOD
>>>> +  range 1 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
>>>> +  range 1 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
>>>> +  range 1 3 if RCU_STRICT_GRACE_PERIOD
>>>>    depends on TREE_RCU && RCU_EXPERT>      default 16 if 
>>>> !RCU_STRICT_GRACE_PERIOD
>>>>    default 2 if RCU_STRICT_GRACE_PERIOD
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> This passes a quick 20-minute rcutorture smoke test.  Does it provide
>>>> similar performance benefits?
>>>
>>> I tried this out, and it also brings down the contention and solves the 
>>> problem
>>> I saw (in testing so far).
>>>
>>> Would this work also if the test had grace periods init/cleanup racing with
>>> preempted RCU read-side critical sections? I'm doing longer tests now to 
>>> see how
>>> this performs under GP-stress, versus my solution. I am also seeing that 
>>> with
>>> just the node lists, not per-cpu list, I see a dramatic throughput drop 
>>> after
>>> some amount of time, but I can't explain it. And I do not see this with the
>>> per-cpu list solution (I'm currently testing if I see the same throughput 
>>> drop
>>> with the fan-out solution you proposed).
>>>
>>> I'm also wondering whether relying on the user to set FANOUT_LEAF to 1 is
>>> reasonable, considering this is not a default. Are you suggesting 
>>> defaulting to
>>> this for small systems? If not, then I guess the optimization will not be
>>> enabled by default. Eventually, with this patch set, if we are moving 
>>> forward
>>> with this approach, I will remove the config option for per-CPU block list
>>> altogether so that it is enabled by default. That's kind of my plan if we 
>>> agreed
>>> on this, but it is just an RFC stage 🙂.
>>
>> So the fanout solution works great when there are grace periods in progress. 
>> I
>> see no throughput drop, and consistent performance with read site critical
>> sections. However, if we switch to having no grace periods continuously
>> happening in progress, I can see the throughput dropping quite a bit here
>> (-30%). I can't explain that, but I do not see that issue with per-CPU lists.
> 
> Might this be due to the change in number of tasks?  Not having the
> thread that continuously runs grace periods might be affecting scheduling
> decisions, and with CPU overcommit, those scheduling decisions can cause
> large changes in throughput.  Plus there are other spinlocks that might
> be subject to vCPU preemption, including the various scheduler spinlocks.

Yeah these are all possible, currently studying it more :)
>> With the per-cpu list scheme, blocking does not involve the node at all, as 
>> long
>> as there is no grace period in progress. So, in that sense, per-CPU blocked 
>> list
>> is completely detached from RCU - it is a bit like lazy RCU in the sense 
>> instead
>> of a callback, it is the blocking task in a per-cpu list, relieving RCU of 
>> the
>> burden.
> 
> Unless I am seriously misreading your patch, the grace-period kthread still
> acquires your per-CPU locks.

Yes, but I am not triggering grace periods (in the tests where I am expecting an
improvement). It is in those tests that I am seeing the throughput drop with
FANOUT, but let me confirm that again. I did run it 200 times and notice this.
I'm not sure what else a fanout of one for leaves does, but this is my chance to
learn about it :).

I am saying when there is no GPs active (that is when the optimization in these
patches is active). In one of the patches, if grace period is in progress or
already started, I do not trigger the optimization. The optimization is only
when grace periods are not active. This is similar to lazy RCU, where, if we
have active grace periods in progress, we don't really make new RCU callbacks
lazy since it is pointless.

> Also, reducing the number of grace periods> should *reduce* contention on the
rcu_node ->lock.
> 
>> Maybe the extra layer of the node tree (with fanout == 1) somehow adds
>> unnecessary overhead that does not exist with Per CPU lists? Even though 
>> there
>> is this throughput drop, it still does better than baseline with a common 
>> RCU node.
>>
>> Based on this, I would say per-cpu blocked list is still worth doing. 
>> Thoughts?
> 
> I think that we need to understand the differences before jumping
> to conclusions.  There are a lot of possible reasons for changes in
> throughput, especially given the CPU overload.  After all, queuing
> theory suggests high variance in that case, possibly even on exactly
> the same setup.

Sure, that's why I'm doing hundreds of runs to get repetitive results and cut
back on the outliers. But it is quite challenging to study all possibilities
given the time constraints. I'm trying to collect traces as much as I can and
study them. The synchronize RCU latency that I just improved, for instance, came
from one of such exercises.

Thanks.
Re: [PATCH RFC 00/14] rcu: Reduce rnp->lock contention with per-CPU blocked task lists

Reply via email to