Paul McKenney noted that a softirq (or irq_work) handler arming for a
deferred QS can fire and find rcu_preempt_depth() > 0 -- the task is
still inside its outer reader, so rcu_preempt_need_deferred_qs() bails
without reporting the QS. At that point the queued mechanism has been
consumed but ->defer_qs_pending stays in DEFER_QS_PENDING.
In the meantime, the only remaining path back to a quiescent state on
this CPU may be a local_irq_disable()/_enable() pair that does not
call preempt_check_resched() (it is just `sti`/`cli`). patch 6's
unconditional set_need_resched_current() makes need_resched true, but
without an irq_work being raised the next outer rcu_read_unlock_special()
hits the P-gate at the arming code:
if (rdp->defer_qs_pending != DEFER_QS_PENDING) {
rdp->defer_qs_pending = DEFER_QS_PENDING;
irq_work_queue_on(...); // <-- skipped
}
so no irq_work is queued for the hardirq-exit preempt_schedule_irq()
path either. The deferred QS now waits until the next timer tick (or
similar preempt-safe boundary), needlessly extending expedited grace
period latency.
Clear ->defer_qs_pending in the bail-out path of rcu_preempt_deferred_qs()
when rcu_preempt_depth() > 0. The recursion guard semantics introduced
by commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ
work").
The clear is also safe against fresh recursion at this exact program
point: rcu_preempt_depth() > 0 guarantees we are still inside an outer
reader, so any inner rcu_read_unlock() from tracing infrastructure
brings nesting back to outer (>0), never to 0. The slow path of
rcu_read_unlock_special() is structurally unreachable under that
condition, so no recursive raise_softirq_irqoff()/irq_work_queue_on()
can be triggered by the clear. Essentially, the mechanism will work to
prevent the following recursion which Xiongfeng had previously reported:
irq_exit() -> __irq_exit_rcu()
-> tick_irq_exit() -> tick_nohz_irq_exit() -> tick_nohz_stop_sched_tick()
-> trace_tick_stop() // BPF prog hooked here
-> rcu_read_unlock_special()
-> irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu) // self-IPI
re-enters irq_exit
Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Joel Fernandes <[email protected]>
---
kernel/rcu/tree_plugin.h | 28 +++++++++++++++++++++++++++-
1 file changed, 27 insertions(+), 1 deletion(-)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 7045f0deee17..960a45631098 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -612,9 +612,35 @@ static notrace bool rcu_preempt_need_deferred_qs(struct
task_struct *t)
notrace void rcu_preempt_deferred_qs(struct task_struct *t)
{
unsigned long flags;
+ struct rcu_data *rdp;
- if (!rcu_preempt_need_deferred_qs(t))
+ if (!rcu_preempt_need_deferred_qs(t)) {
+ /*
+ * If we got here from a softirq/irq_work that fired while
+ * rcu_preempt_depth() > 0, the deferred-QS mechanism has been
+ * consumed without doing any work:
rcu_preempt_need_deferred_qs()
+ * just returned false because the task is still in a reader, so
+ * the actual QS report has to wait for the next
+ * rcu_read_unlock().
+ *
+ * Clear ->defer_qs_pending here so the next outer
+ * rcu_read_unlock_special() can re-arm a fresh mechanism (in
+ * particular the irq_work path, which the local_irq_enable()
+ * recovery boundary cannot itself reschedule from).
+ *
+ * Recursion safety: rcu_preempt_depth() > 0 means we are inside
+ * an outer reader, so any inner rcu_read_unlock() reached via
+ * tracing (bpf programs attached to trace points) brings
+ * nesting to outer (> 0), never to 0, so no recursive
+ * raise_softirq_irqoff()/irq_work_queue_on() can be triggered
+ * by this clear.
+ */
+ if (rcu_preempt_depth() > 0) {
+ rdp = this_cpu_ptr(&rcu_data);
+ rcu_defer_qs_clear(rdp);
+ }
return;
+ }
local_irq_save(flags);
rcu_preempt_deferred_qs_irqrestore(t, flags);
}
--
2.34.1