Paul McKenney noted that a softirq (or irq_work) handler arming for a
deferred QS can fire and find rcu_preempt_depth() > 0 -- the task is
still inside its outer reader, so rcu_preempt_need_deferred_qs() bails
without reporting the QS.  At that point the queued mechanism has been
consumed but ->defer_qs_pending stays in DEFER_QS_PENDING.

In the meantime, the only remaining path back to a quiescent state on
this CPU may be a local_irq_disable()/_enable() pair that does not
call preempt_check_resched() (it is just `sti`/`cli`).  patch 6's
unconditional set_need_resched_current() makes need_resched true, but
without an irq_work being raised the next outer rcu_read_unlock_special()
hits the P-gate at the arming code:

    if (rdp->defer_qs_pending != DEFER_QS_PENDING) {
        rdp->defer_qs_pending = DEFER_QS_PENDING;
        irq_work_queue_on(...);                 // <-- skipped
    }

so no irq_work is queued for the hardirq-exit preempt_schedule_irq()
path either.  The deferred QS now waits until the next timer tick (or
similar preempt-safe boundary), needlessly extending expedited grace
period latency.

Clear ->defer_qs_pending in the bail-out path of rcu_preempt_deferred_qs()
when rcu_preempt_depth() > 0.  The recursion guard semantics introduced
by commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ
work").

The clear is also safe against fresh recursion at this exact program
point: rcu_preempt_depth() > 0 guarantees we are still inside an outer
reader, so any inner rcu_read_unlock() from tracing infrastructure
brings nesting back to outer (>0), never to 0.  The slow path of
rcu_read_unlock_special() is structurally unreachable under that
condition, so no recursive raise_softirq_irqoff()/irq_work_queue_on()
can be triggered by the clear. Essentially, the mechanism will work to
prevent the following recursion which Xiongfeng had previously reported:

irq_exit() -> __irq_exit_rcu()
  -> tick_irq_exit() -> tick_nohz_irq_exit() -> tick_nohz_stop_sched_tick()
    -> trace_tick_stop()                    // BPF prog hooked here
      -> rcu_read_unlock_special()
        -> irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu)   // self-IPI 
re-enters irq_exit

Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Joel Fernandes <[email protected]>
---
 kernel/rcu/tree_plugin.h | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 7045f0deee17..960a45631098 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -612,9 +612,35 @@ static notrace bool rcu_preempt_need_deferred_qs(struct 
task_struct *t)
 notrace void rcu_preempt_deferred_qs(struct task_struct *t)
 {
        unsigned long flags;
+       struct rcu_data *rdp;
 
-       if (!rcu_preempt_need_deferred_qs(t))
+       if (!rcu_preempt_need_deferred_qs(t)) {
+               /*
+                * If we got here from a softirq/irq_work that fired while
+                * rcu_preempt_depth() > 0, the deferred-QS mechanism has been
+                * consumed without doing any work: 
rcu_preempt_need_deferred_qs()
+                * just returned false because the task is still in a reader, so
+                * the actual QS report has to wait for the next
+                * rcu_read_unlock().
+                *
+                * Clear ->defer_qs_pending here so the next outer
+                * rcu_read_unlock_special() can re-arm a fresh mechanism (in
+                * particular the irq_work path, which the local_irq_enable()
+                * recovery boundary cannot itself reschedule from).
+                *
+                * Recursion safety: rcu_preempt_depth() > 0 means we are inside
+                * an outer reader, so any inner rcu_read_unlock() reached via
+                * tracing (bpf programs attached to trace points) brings
+                * nesting to outer (> 0), never to 0, so no recursive
+                * raise_softirq_irqoff()/irq_work_queue_on() can be triggered
+                * by this clear.
+                */
+               if (rcu_preempt_depth() > 0) {
+                       rdp = this_cpu_ptr(&rcu_data);
+                       rcu_defer_qs_clear(rdp);
+               }
                return;
+       }
        local_irq_save(flags);
        rcu_preempt_deferred_qs_irqrestore(t, flags);
 }
-- 
2.34.1


Reply via email to