On Wed, Mar 18, 2026 at 02:55:48PM -0700, Boqun Feng wrote:
> On Wed, Mar 18, 2026 at 02:52:48PM -0700, Boqun Feng wrote:
> [...]
> > > Ah so it is an ABBA deadlock, not a ABA self-deadlock. I guess this is a
> > > different issue, from the NMI issue? It is more of an issue of calling
> > > call_srcu  API with scheduler locks held.
> > > 
> > > Something like below I think:
> > > 
> > >   CPU A (BPF tracepoint)                CPU B (concurrent call_srcu)
> > >   ----------------------------         
> > > ------------------------------------
> > >   [1] holds  &rq->__lock
> > >                                         [2]
> > >                                         -> call_srcu
> > >                                         -> srcu_gp_start_if_needed
> > >                                         -> srcu_funnel_gp_start
> > >                                         -> 
> > > spin_lock_irqsave_ssp_content...
> > >                                     -> holds srcu locks
> > > 
> > >   [4] calls  call_rcu_tasks_trace()      [5] srcu_funnel_gp_start (cont..)
> > >                                                  -> queue_delayed_work
> > >           -> call_srcu()                         -> __queue_work()
> > >           -> srcu_gp_start_if_needed()           -> wake_up_worker()
> > >           -> srcu_funnel_gp_start()              -> try_to_wake_up()
> > >           -> spin_lock_irqsave_ssp_contention()  [6] WANTS  rq->__lock
> > >           -> WANTS srcu locks
> > 
> > I see, we can also have a self deadlock even without CPU B, when CPU A
> > is going to try_to_wake_up() the a worker on the same CPU.
> > 
> > An interesting observation is that the deadlock can be avoided in
> > queue_delayed_work() uses a non-zero delay, that means a timer will be
> > armed instead of acquiring the rq lock.
> > 

If my observation is correct, then this can probably fix the deadlock
issue with runqueue lock (untested though), but it won't work if BPF
tracepoint can happen with timer base lock held.

Regards,
Boqun

------>
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 2328827f8775..a5d67264acb5 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -1061,6 +1061,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, 
struct srcu_data *sdp,
        struct srcu_node *snp_leaf;
        unsigned long snp_seq;
        struct srcu_usage *sup = ssp->srcu_sup;
+       bool irqs_were_disabled;

        /* Ensure that snp node tree is fully initialized before traversing it 
*/
        if (smp_load_acquire(&sup->srcu_size_state) < SRCU_SIZE_WAIT_BARRIER)
@@ -1098,6 +1099,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, 
struct srcu_data *sdp,

        /* Top of tree, must ensure the grace period will be started. */
        raw_spin_lock_irqsave_ssp_contention(ssp, &flags);
+       irqs_were_disabled = irqs_disabled_flags(flags);
        if (ULONG_CMP_LT(sup->srcu_gp_seq_needed, s)) {
                /*
                 * Record need for grace period s.  Pair with load
@@ -1118,9 +1120,16 @@ static void srcu_funnel_gp_start(struct srcu_struct 
*ssp, struct srcu_data *sdp,
                // it isn't.  And it does not have to be.  After all, it
                // can only be executed during early boot when there is only
                // the one boot CPU running with interrupts still disabled.
+               //
+               // If irq was disabled when call_srcu() is called, then we
+               // could be in the scheduler path with a runqueue lock held,
+               // delay the process_srcu() work 1 more jiffies so we don't go
+               // through the kick_pool() -> wake_up_process() path below, and
+               // we could avoid deadlock with runqueue lock.
                if (likely(srcu_init_done))
                        queue_delayed_work(rcu_gp_wq, &sup->work,
-                                          !!srcu_get_delay(ssp));
+                                          !!srcu_get_delay(ssp) +
+                                          !!irqs_were_disabled);
                else if (list_empty(&sup->work.work.entry))
                        list_add(&sup->work.work.entry, &srcu_boot_list);
        }

Reply via email to