The deferred-QS irq-work handler previously cleared defer_qs_pending
only when the handler ran inside an active rcu_read_lock() critical
section (rcu_preempt_depth() > 0). Paul McKenney pointed out a common
multi-segment compound pattern where the handler fires between
segments and segment N+1's arming attempt is silently suppressed by
the rcu_read_unlock_special() pending-gate:
rcu_read_lock(); // segment 1 starts
// may be preempted/boosted here
local_irq_disable();
rcu_read_unlock(); // segment 1 ends; arms defer_qs_pending
preempt_disable();
local_irq_enable(); // handler MAY fire here: depth==0, but
// but preempt is disabled, so it cant
// nudge.
rcu_read_lock(); // segment 2 starts
preempt_enable();
local_irq_disable();
rcu_read_unlock(); // arming attempt suppressed incorrectly -- (1)
local_irq_enable();
Waiting for the next __note_gp_changes() clear is too slow for the
compound case, we need the deferred QS report sooner.
Therefore, make the irq_work handler clear defer_qs_pending whenever
rcu_in_compounded_section() is true so that (1) can do the arming.
In addition, introduce rcu_preempt_deferred_qs_try_report(), a small
helper that reports the deferred QS (and releases any RCU priority
boost) directly, but only from a clean, non-reader/compound context.
When the handler lands in such a clean context it now reports the QS
directly instead of merely nudging the scheduler: this makes the
irq_work robust under preempt=none / voluntary, where a
set_need_resched() nudge would not enter __schedule() at IRQ exit and
the QS would otherwise wait for the next tick. When still compounded,
the handler falls back to clearing defer_qs_pending as before. The
bounded-delay rescue hrtimer added in a later patch reuses this same
helper.
Signed-off-by: Joel Fernandes <[email protected]>
---
kernel/rcu/tree_plugin.h | 46 ++++++++++++++++++++++++++++------------
1 file changed, 33 insertions(+), 13 deletions(-)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 8637f405cb47..9b167eaf8e0d 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -622,7 +622,32 @@ notrace void rcu_preempt_deferred_qs(struct task_struct *t)
}
/*
- * Minimal handler to give the scheduler a chance to re-evaluate.
+ * Report a deferred quiescent state but only from a safe context.
+ *
+ * Both callers (the irq_work handler and the bounded-delay rescue hrtimer)
+ * run in hardirq context, so preempt_count() always has the HARDIRQ bit set;
+ * the compound-section check below deliberately inspects only the
+ * PREEMPT_MASK | SOFTIRQ_MASK bits, which reflect the INTERRUPTED caller's
+ * state, not ours.
+ */
+static bool rcu_preempt_deferred_qs_try_report(struct task_struct *t)
+{
+ unsigned long flags;
+
+ if (rcu_preempt_depth() > 0 ||
+ (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)))
+ return false;
+
+ if (rcu_preempt_need_deferred_qs(t)) {
+ local_irq_save(flags);
+ rcu_preempt_deferred_qs_irqrestore(t, flags);
+ }
+ return true;
+}
+
+/*
+ * Minimal handler to give the scheduler a chance to re-evaluate, and to
+ * report the deferred QS directly when the handler lands in a clean context.
*/
static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
{
@@ -632,19 +657,14 @@ static void rcu_preempt_deferred_qs_handler(struct
irq_work *iwp)
rdp = container_of(iwp, struct rcu_data, defer_qs_iw);
/*
- * If the IRQ work handler happens to run in the middle of RCU read-side
- * critical section, it could be ineffective in getting the scheduler's
- * attention to report a deferred quiescent state (the whole point of
the
- * IRQ work). For this reason, requeue the IRQ work.
- *
- * Basically, we want to avoid following situation:
- * 1. rcu_read_unlock() queues IRQ work (state -> DEFER_QS_PENDING)
- * 2. CPU enters new rcu_read_lock()
- * 3. IRQ work runs but cannot report QS due to rcu_preempt_depth() > 0
- * 4. rcu_read_unlock() does not re-queue work (state still PENDING)
- * 5. Deferred QS reporting does not happen.
+ * If the handler fired in a clean context, report the deferred QS
+ * directly. This makes the irq_work robust under preempt=none /
+ * voluntary, where the set_need_resched() nudge would not enter
+ * __schedule() at IRQ exit. Otherwise we are still inside a reader /
+ * compound section: just clear defer_qs_pending so the next
+ * rcu_read_unlock() can rearm.
*/
- if (rcu_preempt_depth() > 0)
+ if (!rcu_preempt_deferred_qs_try_report(current))
rcu_defer_qs_clear(rdp);
}
--
2.34.1