> On Jan 7, 2026, at 8:35 PM, Joel Fernandes <[email protected]> wrote:
> 
> 
> 
>> On Jan 7, 2026, at 8:02 PM, Joel Fernandes <[email protected]> wrote:
>> 
>> 
>> 
>>>> On Jan 7, 2026, at 6:15 PM, Frederic Weisbecker <[email protected]> 
>>>> wrote:
>>> 
>>> Le Thu, Jan 01, 2026 at 11:34:10AM -0500, Joel Fernandes a écrit :
>>>> From: Yao Kai <[email protected]>
>>>> 
>>>> Commit 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in
>>>> __rcu_read_unlock()") removes the recursion-protection code from
>>>> __rcu_read_unlock(). Therefore, we could invoke the deadloop in
>>>> raise_softirq_irqoff() with ftrace enabled as follows:
>>>> 
>>>> WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 
>>>> __ftrace_trace_stack.constprop.0+0x172/0x180
>>>> Modules linked in: my_irq_work(O)
>>>> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 
>>>> PREEMPT(full)
>>>> Tainted: [O]=OOT_MODULE
>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 
>>>> 04/01/2014
>>>> RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180
>>>> RSP: 0018:ffffc900000034a8 EFLAGS: 00010002
>>>> RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
>>>> RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329
>>>> RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c
>>>> R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000
>>>> R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054
>>>> FS:  0000000000000000(0000) GS:ffff8880fa8ea000(0000) 
>>>> knlGS:0000000000000000
>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0
>>>> PKRU: 55555554
>>>> Call Trace:
>>>> <IRQ>
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> unwind_next_frame+0x203/0x9b0
>>>> __unwind_start+0x15d/0x1c0
>>>> arch_stack_walk+0x62/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> unwind_next_frame+0x203/0x9b0
>>>> __unwind_start+0x15d/0x1c0
>>>> arch_stack_walk+0x62/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> unwind_next_frame+0x203/0x9b0
>>>> __unwind_start+0x15d/0x1c0
>>>> arch_stack_walk+0x62/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> __is_insn_slot_addr+0x54/0x70
>>>> kernel_text_address+0x48/0xc0
>>>> __kernel_text_address+0xd/0x40
>>>> unwind_get_return_address+0x1e/0x40
>>>> arch_stack_walk+0x9c/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> __raise_softirq_irqoff+0x61/0x80
>>>> __flush_smp_call_function_queue+0x115/0x420
>>>> __sysvec_call_function_single+0x17/0xb0
>>>> sysvec_call_function_single+0x8c/0xc0
>>>> </IRQ>
>>>> 
>>>> Commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
>>>> fixed the infinite loop in rcu_read_unlock_special() for IRQ work by
>>>> setting a flag before calling irq_work_queue_on(). We fix this issue by
>>>> setting the same flag before calling raise_softirq_irqoff() and rename the
>>>> flag to defer_qs_pending for more common.
>>>> 
>>>> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in 
>>>> __rcu_read_unlock()")
>>>> Reported-by: Tengda Wu <[email protected]>
>>>> Signed-off-by: Yao Kai <[email protected]>
>>>> Reviewed-by: Joel Fernandes <[email protected]>
>>>> Signed-off-by: Joel Fernandes <[email protected]>
>>> 
>>> Looks good but, BTW, what happens if rcu_qs() is called
>>> before rcu_preempt_deferred_qs() had a chance to be called?
>> 
>> Could you provide an example of when that can happen?
>> 
>> If rcu_qs() results in reporting of a quiescent state up the node tree 
>> before the deferred reporting work had a chance to act, then indeed we 
>> should be clearing the flag (after canceling the pending 
>> raise_softirq_irqoff()).
>> 
>>>> flag to defer_qs_pending for more common.
>>>> 
>>>> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in 
>>>> __rcu_read_unlock()")
>>>> Reported-by: Tengda Wu <[email protected]>
>>>> Signed-off-by: Yao Kai <[email protected]>
>>>> Reviewed-by: Joel Fernandes <[email protected]>
>>>> Signed-off-by: Joel Fernandes <[email protected]>
>>> 
>>> Looks good but, BTW, what happens if rcu_qs() is called
>>> before rcu_preempt_deferred_qs() had a chance to be called?
>> 
>> Could you provide an example of when that can happen?
>> 
>> As far as I can see, even if that were to happen, which I think you are 
>> right it can happen, we will still go through the path to report deferred 
>> quiescent states and cancel the pending work (reset the flag).
>> 
>>> current->rcu_read_unlock_special.b.need_qs is reset by rcu_qs()
>>> so subsequent calls to rcu_read_unlock() won't issue 
>>> rcu_read_unlock_special()
>>> (unless the task is blocked). And further calls to rcu_preempt_deferred_qs()
>>> through rcu_core() will be ignored as well.
>> 
>> I am not sure if this implies that deferred quiescent state gets cancelled 
>> because we have already called unlock once. We have to go through the 
>> deferred quiescent state path on all subsequent quiescent state reporting, 
>> even if need_qs reset. How else will the GP complete.
>>> 
>>> But rdp->defer_qs_pending will remain in the DEFER_QS_PENDING state until
>>> the next grace period. And if rcu_read_unlock_special() is called again
>>> during the next GP on unfortunate place needing deferred qs, the state 
>>> machine
>>> will spuriously assume that either rcu_core or the irq_work are pending, 
>>> when
>>> none are anymore.
>>> 
>>> The state should be reset by rcu_qs().
>> 
>> In fact I would say if a deferred QS is pending, we should absolutely not 
>> reset its state from rcu_qs..
>> 
>> Maybe we should reset it from rcu_report_qs_rdp/rnp?
>> 
>> Unfortunately, all of this is coming from me being on a phone and not at a 
>> computer, so I will revise my response, but probably tomorrow, because today 
>> the human body is not cooperating.
>> 
>> thanks,
>> 
>> - Joel
>> 
>> 
>>> current->rcu_read_unlock_special.b.need_qs is reset by rcu_qs()
>>> so subsequent calls to rcu_read_unlock() won't issue 
>>> rcu_read_unlock_special()
>>> (unless the task is blocked). And further calls to rcu_preempt_deferred_qs()
>>> through rcu_core() will be ignored as well.
>> 
>> I am not sure if this implies that deferred quiescent state gets cancelled 
>> because we have already called unlock once. We have to go through the 
>> deferred quiescent state path on all subsequent quiescent state reporting, 
>> even if need_qs reset. How else will the GP complete.
>>> 
>>> But rdp->defer_qs_pending will remain in the DEFER_QS_PENDING state until
>>> the next grace period. And if rcu_read_unlock_special() is called again
>>> during the next GP on unfortunate place needing deferred qs, the state 
>>> machine
>>> will spuriously assume that either rcu_core or the irq_work are pending, 
>>> when
>>> none are anymore.
>>> 
>>> The state should be reset by rcu_qs().
>> 
>> In fact I would say if a deferred QS is pending, we should absolutely not 
>> reset its state from rcu_qs..
>> 
>> Maybe we should reset it from rcu_report_qs_rdp/rnp?
>> 
>> thanks,
> 
> 
> By the way, when I last tried to do it from rcu_qs, it was not fixing the 
> original bug with the IRQ work recursion.
> 
> I found that it was always resetting the flag. But probably it is not even 
> the right place to do it in the first place.

I think we need to reset the flag in rcu_report_exp_rdp() as well if exp_hint 
is set and we reported exp qs.

 I am working on a series to cover all cases and will send RFC soon. However 
this patch we are 
reviewing can go in for this merge window and the rest I am preparing (for 
further improvement) for the next merge window, if that sounds good.

Thanks!

 - Joel


> 
> Thanks,
> 
> - Joel
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> 
>> - Joel
>> 
>> 
>>> 
>>> Thanks.
>>> 
>>> --
>>> Frederic Weisbecker
>>> SUSE Labs
>>> 

Reply via email to