Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT

Joel Fernandes Wed, 18 Mar 2026 13:25:58 -0700

On 3/18/2026 4:11 PM, Kumar Kartikeya Dwivedi wrote:
> On Wed, 18 Mar 2026 at 21:04, Joel Fernandes <[email protected]> wrote:
>>
>> On 3/18/2026 2:42 PM, Paul E. McKenney wrote:
>>> On Wed, Mar 18, 2026 at 08:51:16AM -0700, Boqun Feng wrote:
>>>> On Wed, Mar 18, 2026 at 03:43:05PM +0100, Sebastian Andrzej Siewior wrote:
>>>> [..]
>>>>>>>> way that vanilla RCU's call_rcu_core() function takes an early exit if
>>>>>>>> interrupts are disabled.  Of course, vanilla RCU can rely on things 
>>>>>>>> like
>>>>>>>> the scheduling-clock interrupt to start any needed grace periods [1],
>>>>>>>> but SRCU will instead need to manually defer this work, perhaps using
>>>>>>>> workqueues or IRQ work.
>>>>>>>>
>>>>>>>> In addition, rcutorture needs to be upgraded to sometimes invoke
>>>>>>>> ->call() with the scheduler pi lock held, but this change is not fixing
>>>>>>>> a regression, so could be deferred.  (There is already code in 
>>>>>>>> rcutorture
>>>>>>>> that invokes the readers while holding a scheduler pi lock.)
>>>>>>>>
>>>>>>>> Given that RCU for this week through the end of March belongs to you 
>>>>>>>> guys,
>>>>>>>> if one of you can get this done by end of day Thursday, London time,
>>>>>>>> very good!  Otherwise, I can put something together.
>>>>>>>>
>>>>>>>> Please let me know!
>>>>>>>
>>>>>>> Given that the current locking does allow it and lockdep should have
>>>>>>> complained, I am curious if we could rule that out ;)
>>>>>
>>>>> Your patch just s/spinlock_t/raw_spinlock_t so we get the locking/
>>>>> nesting right. The wakeup problem remains, right?
>>>>> But looking at the code, there is just srcu_funnel_gp_start(). If its
>>>>> srcu_schedule_cbs_sdp() / queue_delayed_work() usage is always delayed
>>>>> then there will be always a timer and never a direct wake up of the
>>>>> worker. Wouldn't that work?
>>>>
>>>> Late to the party, so just make sure I understand the problem. The
>>>> problem is the wakeup in call_srcu() when it's called with scheduler
>>>> lock held, right? If so I think the current code works as what you
>>>> already explain, we defer the wakeup into a workqueue.
>>>
>>> The issue is that call_rcu_tasks() (which is call_srcu() now) is
>>> also invoked with a scheduler pi/rq lock held, which results in a
>>> deadlock cycle.  So the srcu_gp_start_if_needed() function's call to
>>> raw_spin_lock_irqsave_sdp_contention() must be deferred to the workqueue
>>> handler, not just the wake-up.  And that in turn means that the callback
>>> point also needs to be passed to this handler.
>>>
>>> See this email thread:
>>>
>>> https://lore.kernel.org/all/cap01t75ekpvw+95nqnwg9p-1+kzvzojpn0nlat+28sf1b9w...@mail.gmail.com/
>>>
>>>> (but Paul, we are not talking about calling call_srcu(), that requires
>>>> some more work to get it work)
>>>
>>> Agreed, splitting srcu_gp_start_if_needed() and using a workqueue if
>>> interrupts were already disabled on entry.  Otherwise, directly invoking
>>> the split-out portion of srcu_gp_start_if_needed().
>>>
>>> But we might be talking past each other.
>>>
>>
>> Ah so it is an ABBA deadlock, not a ABA self-deadlock. I guess this is a
>> different issue, from the NMI issue? It is more of an issue of calling
>> call_srcu  API with scheduler locks held.
>>
>> Something like below I think:
>>
>>   CPU A (BPF tracepoint)                CPU B (concurrent call_srcu)
>>   ----------------------------         ------------------------------------
>>   [1] holds  &rq->__lock
>>                                         [2]
>>                                         -> call_srcu
>>                                         -> srcu_gp_start_if_needed
>>                                         -> srcu_funnel_gp_start
>>                                         -> spin_lock_irqsave_ssp_content...
>>                                           -> holds srcu locks
>>
>>   [4] calls  call_rcu_tasks_trace()      [5] srcu_funnel_gp_start (cont..)
>>                                                  -> queue_delayed_work
>>           -> call_srcu()                         -> __queue_work()
>>           -> srcu_gp_start_if_needed()           -> wake_up_worker()
>>           -> srcu_funnel_gp_start()              -> try_to_wake_up()
>>           -> spin_lock_irqsave_ssp_contention()  [6] WANTS  rq->__lock
>>           -> WANTS srcu locks
>>
>> If I understand this, this looks like an issue that can happen independent
>> of the conversion of the spin locks.
>>
> 
> Yes, this is a separate issue, we should make the conversion to raw
> spin locks anyway, but lockdep found this once we applied that fix
> from Paul.
> In sched-ext, we can end up calling call_srcu() while rq->lock is
> held, e.g. from exit_task() -> some bpf map that deletes an element ->
> call_srcu().
> There are other callbacks of course where it can be held, and other
> programs that can run tracing the kernel while it is held.
> 
Thanks. I guess I am also wondering, why didn't lockdep find it without the
conversion to raw spin locks though? An ABBA deadlock should have been
detected either way. Is there some difference in lockdep's ability to find
deadlocks depending on whether a spinlock is raw?

Anyway, I am applying the raw lock conversion fix and running some more tests.

thanks,

--
Joel Fernandes
Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT

Reply via email to