This response was AI-generated by bug-bot. The analysis may contain errors — 
please verify independently.

## Bug Summary

This is an RCU stall and hung task deadlock on 7.0.0-rc1, triggered by perf 
trace teardown under perf interrupt storm conditions. The perf subsystem's 
tracepoint unregistration path now blocks on SRCU (tracepoint_srcu), which in 
turn blocks on RCU grace period completion, creating a cascading stall when RCU 
progress is delayed by perf NMI interrupt storms. Severity: system hang 
(multiple tasks blocked >143s, eventual complete stall).

## Stack Trace Analysis

The bug involves three interacting blocked entities. Here are the decoded stack 
traces:

**1. repro2 (pid 4086) - blocked in perf trace teardown (close()):**
```
__x64_sys_close
  fput_close_sync
    __fput
      perf_release
        perf_event_release_kernel
          put_event
            __free_event
              perf_trace_destroy
                perf_trace_event_unreg         
[kernel/trace/trace_event_perf.c:154]
                  tracepoint_synchronize_unregister 
[include/linux/tracepoint.h:116]
                    synchronize_srcu(&tracepoint_srcu)
                      __synchronize_srcu
                        wait_for_completion    ← BLOCKED
```

**2. kworker/0:0 (pid 9) and kworker/0:1 (pid 11) - SRCU grace period workers:**
```
Workqueue: rcu_gp process_srcu
process_srcu                                   [kernel/rcu/srcutree.c:1304]
  srcu_advance_state                           [kernel/rcu/srcutree.c:1161]
    try_check_zero                             [kernel/rcu/srcutree.c:1171]
      srcu_readers_active_idx_check            [kernel/rcu/srcutree.c:544]
        synchronize_rcu()                      ← SRCU-fast path, line 569
          synchronize_rcu_normal
            wait_for_completion                ← BLOCKED
```

**3. repro2 (pid 4093) - RCU stall source:**
```
rcu: Tasks blocked on level-0 rcu_node (CPUs 0-1): P4093
task:repro2   state:R  running task
  (running in futex_wake syscall, interrupted by timer IRQ)
  asm_sysvec_apic_timer_interrupt
    irqentry_exit → preempt_schedule_irq → __schedule
      finish_task_switch
```

The trace shows process context for the hung tasks and interrupt context (timer 
IRQ) for the RCU stall detection. The kworkers are in D (uninterruptible sleep) 
state, blocked in wait_for_completion() within the SRCU grace period state 
machine.

## Root Cause Analysis

This is a regression introduced by commit a46023d5616ed ("tracing: Guard 
__DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast"), which switched 
tracepoint read-side protection from preempt_disable()+RCU to SRCU-fast via 
DEFINE_SRCU_FAST(tracepoint_srcu).

The root cause is a new coupling between SRCU grace period processing and RCU 
grace period completion that did not exist before. The deadlock chain is:

1. The reproducer creates perf events using tracepoints, then closes them while 
generating heavy perf interrupt load. The perf NMI interrupt storms ("perf: 
interrupt took too long" messages escalating from 69ms to 336ms) consume most 
CPU time, starving RCU quiescent state detection.

2. When the perf fd is closed, perf_trace_event_unreg() 
(kernel/trace/trace_event_perf.c:154) calls tracepoint_synchronize_unregister() 
(include/linux/tracepoint.h:116), which now calls 
synchronize_srcu(&tracepoint_srcu) instead of synchronize_rcu().

3. The SRCU grace period for tracepoint_srcu is processed by process_srcu() 
running in the rcu_gp workqueue. Because tracepoint_srcu is DEFINE_SRCU_FAST, 
its srcu_reader_flavor includes SRCU_READ_FLAVOR_FAST, which is part of 
SRCU_READ_FLAVOR_SLOWGP.

4. In srcu_readers_active_idx_check() (kernel/rcu/srcutree.c:544), when 
SRCU_READ_FLAVOR_SLOWGP is detected, the function calls synchronize_rcu() (line 
569) instead of smp_mb() (line 301 in non-fast path). This is the key design 
tradeoff of SRCU-fast: faster readers (no smp_mb() on read side) at the cost of 
slower grace periods (synchronize_rcu() on update side).

5. synchronize_rcu() → synchronize_rcu_normal() → wait_for_completion(), 
waiting for an RCU grace period to complete. But the RCU grace period is 
stalled because the perf interrupt storms are preventing CPUs from passing 
through quiescent states quickly enough.

6. Since process_srcu is blocked waiting for synchronize_rcu(), the 
tracepoint_srcu SRCU grace period cannot advance, so 
synchronize_srcu(&tracepoint_srcu) in the perf teardown path also blocks 
indefinitely.

The pre-existing condition (perf NMI storms causing RCU stalls) was previously 
tolerable because the perf teardown path used synchronize_rcu() directly (via 
the old tracepoint_synchronize_unregister()), which would eventually complete 
once the RCU stall resolved. Now, with SRCU-fast, there is an additional layer 
of indirection: perf teardown waits on SRCU, SRCU processing waits on RCU, and 
both the SRCU workqueue threads and the perf teardown task are stuck.

## Affected Versions

This is a regression in v7.0-rc1. The bug was introduced by commit 
a46023d5616ed ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with 
SRCU-fast"), which was merged via the trace-v7.0 merge (3c6e577d5ae70). The 
underlying SRCU-fast infrastructure was added by commit c4020620528e4 ("srcu: 
Add SRCU-fast readers") and 4d86b1e7e1e98 ("srcu: Add SRCU_READ_FLAVOR_SLOWGP 
to flag need for synchronize_rcu()"), but the regression became triggerable 
only when a46023d5616ed applied SRCU-fast to the tracepoint_srcu used in the 
perf event teardown path.

Kernels before v7.0-rc1 (i.e., v6.x and earlier) are not affected, as they used 
preempt_disable()+RCU for tracepoint protection, and 
tracepoint_synchronize_unregister() called synchronize_rcu() directly without 
SRCU involvement.

## Relevant Commits and Fixes

Key commits in the causal chain:

- a46023d5616ed ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() 
with SRCU-fast") - the commit that introduced the regression by switching 
tracepoints to SRCU-fast
- a77cb6a867667 ("srcu: Fix warning to permit SRCU-fast readers in NMI 
handlers") - immediate predecessor fix
- c4020620528e4 ("srcu: Add SRCU-fast readers") - added the SRCU-fast reader API
- 4d86b1e7e1e98 ("srcu: Add SRCU_READ_FLAVOR_SLOWGP to flag need for 
synchronize_rcu()") - added the synchronize_rcu()-instead-of-smp_mb() logic in 
SRCU grace period processing
- 16718274ee75d ("tracing: perf: Have perf tracepoint callbacks always disable 
preemption") - preparatory commit for the SRCU-fast switch

No fix for this specific issue was found in mainline or in any -next branches 
as of today.

## Prior Discussions

No prior reports of this specific RCU stall / SRCU deadlock triggered via perf 
trace teardown with SRCU-fast were found on lore.kernel.org. The original 
SRCU-fast tracepoint series was posted at 
https://lore.kernel.org/all/[email protected]/ (linked from 
the commit message), motivated by enabling preemptible BPF on tracepoints for 
RT systems 
(https://lore.kernel.org/all/[email protected]/). 
No discussion of the synchronize_rcu()-from-workqueue stall scenario appears to 
have taken place in those threads.

## Suggested Actions

1. Confirm the regression by testing with the parent commit a77cb6a867667 
(immediately before a46023d5616ed). If the issue disappears, this confirms the 
SRCU-fast tracepoint switch as the cause.

2. As a quick workaround, reverting a46023d5616ed (and its preparatory commits 
a77cb6a867667, f7d327654b886, 16718274ee75d if needed) should eliminate the 
deadlock, at the cost of losing preemptible BPF tracepoint support.

3. The fundamental issue is that process_srcu() for SRCU-fast structures calls 
synchronize_rcu() synchronously from workqueue context. Possible fixes include:
   - Using an asynchronous mechanism (e.g., call_rcu() with a callback to 
resume SRCU GP processing) instead of blocking synchronize_rcu() within the 
SRCU state machine.
   - Having srcu_readers_active_idx_check() use poll_state_synchronize_rcu() 
and defer retrying instead of blocking.
   - Bounding the perf interrupt rate escalation to prevent the RCU stall in 
the first place (though this would only mask the underlying SRCU↔RCU coupling 
issue).

4. If you can reproduce reliably, adding the following debug options would 
provide more information: CONFIG_RCU_TRACE=y, CONFIG_PROVE_RCU=y, and booting 
with rcutree.rcu_kick_kthreads=1 to see if kicking the RCU threads helps break 
the stall.


Reply via email to