Hi Joel, On Mon, Jun 09, 2025 at 02:01:24PM -0400, Joel Fernandes wrote: > During rcu_read_unlock_special(), if this happens during irq_exit(), we > can lockup if an IPI is issued. This is because the IPI itself triggers > the irq_exit() path causing a recursive lock up. > > This is precisely what Xiongfeng found when invoking a BPF program on > the trace_tick_stop() tracepoint As shown in the trace below. Fix by > using context-tracking to tell us if we're still in an IRQ. > context-tracking keeps track of the IRQ until after the tracepoint, so > it cures the issues. >
This does fix the issue, but do we know when the CPU will eventually report a QS after this fix? I believe we still want to report a QS as early as possible in this case? Regards, Boqun > irq_exit() > __irq_exit_rcu() > /* in_hardirq() returns false after this */ > preempt_count_sub(HARDIRQ_OFFSET) > tick_irq_exit() > tick_nohz_irq_exit() > tick_nohz_stop_sched_tick() > trace_tick_stop() /* a bpf prog is hooked on this trace point */ > __bpf_trace_tick_stop() > bpf_trace_run2() > rcu_read_unlock_special() > /* will send a IPI to itself */ > irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu); > > A simple reproducer can also be obtained by doing the following in > tick_irq_exit(). It will hang on boot without the patch: > > static inline void tick_irq_exit(void) > { > + rcu_read_lock(); > + WRITE_ONCE(current->rcu_read_unlock_special.b.need_qs, true); > + rcu_read_unlock(); > + > > While at it, add some comments to this code. > > Reported-by: Xiongfeng Wang <wangxiongfe...@huawei.com> > Closes: > https://lore.kernel.org/all/9acd5f9f-6732-7701-6880-4b51190aa...@huawei.com/ > Tested-by: Xiongfeng Wang <wangxiongfe...@huawei.com> > Signed-off-by: Joel Fernandes <joelagn...@nvidia.com> > --- > kernel/rcu/tree_plugin.h | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h > index 3c0bbbbb686f..53d8b3415776 100644 > --- a/kernel/rcu/tree_plugin.h > +++ b/kernel/rcu/tree_plugin.h > @@ -653,6 +653,9 @@ static void rcu_read_unlock_special(struct task_struct *t) > struct rcu_data *rdp = this_cpu_ptr(&rcu_data); > struct rcu_node *rnp = rdp->mynode; > > + // In cases where the RCU-reader is boosted, we'd attempt > deboost sooner than > + // later to prevent inducing latency to other RT tasks. Also, > expedited GPs > + // should not be delayed too much. Track both these needs in > expboost. > expboost = (t->rcu_blocked_node && > READ_ONCE(t->rcu_blocked_node->exp_tasks)) || > (rdp->grpmask & READ_ONCE(rnp->expmask)) || > (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) && > @@ -670,10 +673,15 @@ static void rcu_read_unlock_special(struct task_struct > *t) > // Also if no expediting and no possible deboosting, > // slow is OK. Plus nohz_full CPUs eventually get > // tick enabled. > + // > + // Also prevent doing this if context-tracking thinks > + // we're handling an IRQ (including when we're exiting > + // one -- required to prevent self-IPI deadloops). > set_tsk_need_resched(current); > set_preempt_need_resched(); > if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled && > - expboost && !rdp->defer_qs_iw_pending && > cpu_online(rdp->cpu)) { > + expboost && !rdp->defer_qs_iw_pending && > cpu_online(rdp->cpu) && > + !ct_in_irq()) { > // Get scheduler to re-evaluate and call hooks. > // If !IRQ_WORK, FQS scan will eventually IPI. > if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) > && > -- > 2.34.1 > >