From: Sonam Sanju <[email protected]>

On Wed, Apr 01, 2026 at 05:34:58PM +0800, Kunwu Chan wrote:
> Building on the discussion so far, it would be helpful from the SRCU
> side to gather a bit more evidence to classify the issue.
>
> Calling synchronize_srcu_expedited() while holding a mutex is generally
> valid, so the observed behavior may be workload-dependent.

> The reported deadlock seems to rely on the assumption that SRCU grace
> period progress is indirectly blocked by irqfd workqueue saturation.
> It would be good to confirm whether that assumption actually holds.

I went back through our logs from two independent crash instances and
can now provide data for each of your questions.

> 1) Are SRCU GP kthreads/workers still making forward progress when
> the system is stuck?

No.  In both crash instances, process_srcu work items remain permanently
"pending" (never "in-flight") throughout the entire hang.

Instance 1 —  kernel 6.18.8, pool 14 (cpus=3):

  [  62.712760] workqueue rcu_gp: flags=0x108
  [  62.717801]   pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
  [  62.717801]     pending: 2*process_srcu

  [  187.735092] workqueue rcu_gp: flags=0x108           (125 seconds later)
  [  187.735093]   pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
  [  187.735093]     pending: 2*process_srcu              (still pending)

  9 consecutive dumps from t=62s to t=312s — process_srcu never runs.

Instance 2 —  kernel 6.18.2, pool 22 (cpus=5):

  [  93.280711] workqueue rcu_gp: flags=0x108
  [  93.280713]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
  [  93.280716]     pending: process_srcu

  [  309.040801] workqueue rcu_gp: flags=0x108           (216 seconds later)
  [  309.040806]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
  [  309.040806]     pending: process_srcu               (still pending)

  8 consecutive dumps from t=93s to t=341s — process_srcu never runs.

In both cases, rcu_gp's process_srcu is bound to the SAME per-CPU pool
where the kvm-irqfd-cleanup workers are blocked.  Both pools have idle
workers but are marked as hung/stalled:

  Instance 1: pool 14: cpus=3 hung=174s workers=11 idle: 4046 4038 4045 4039 
4043 156 77 (7 idle)
  Instance 2: pool 22: cpus=5 hung=297s workers=12 idle: 4242 51 4248 4247 4245 
435 4244 4239 (8 idle)

> 2) How many irqfd workers are active in the reported scenario, and
> can they saturate CPU or worker pools?

4 kvm-irqfd-cleanup workers in both instances, consistently across all
dumps:

Instance 1 ( pool 14 / cpus=3):

  [  62.831877] workqueue kvm-irqfd-cleanup: flags=0x0
  [  62.837838]   pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=4 refcnt=5
  [  62.837838]     in-flight: 157:irqfd_shutdown ,4044:irqfd_shutdown ,
                               102:irqfd_shutdown ,39:irqfd_shutdown

Instance 2 ( pool 22 / cpus=5):

  [  93.280894] workqueue kvm-irqfd-cleanup: flags=0x0
  [  93.280896]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=4 refcnt=5
  [  93.280900]     in-flight: 151:irqfd_shutdown ,4246:irqfd_shutdown ,
                               4241:irqfd_shutdown ,4243:irqfd_shutdown

These are from crosvm instances with multiple virtio devices
(virtio-blk, virtio-net, virtio-input, etc.), each registering an irqfd
with a resampler.  During VM shutdown, all irqfds are detached
concurrently, queueing that many irqfd_shutdown work items.

The 4 workers are not saturating CPU — they're all in D state.  But they
ARE all bound to the same per-CPU pool as rcu_gp's process_srcu work.

> 3) Do we have a concrete wait-for cycle showing that tasks blocked
> on resampler_lock are in turn preventing SRCU GP completion?

Yes, in both instances the hung task dump identifies the mutex holder
stuck in synchronize_srcu, with the other workers waiting on the mutex.

Instance 1 (t=314s):

  Worker pid 4044 — MUTEX HOLDER, stuck in synchronize_srcu:

    [  315.963979] task:kworker/3:8     state:D  pid:4044
    [  315.977125] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
    [  316.012504]  __synchronize_srcu+0x100/0x130
    [  316.023157]  irqfd_resampler_shutdown+0xf0/0x150  <-- offset 0xf0 
(synchronize_srcu)

  Workers pid 39, 102, 157 — MUTEX WAITERS:

    [  314.793025] task:kworker/3:4     state:D  pid:157
    [  314.837472]  __mutex_lock+0x409/0xd90
    [  314.843100]  irqfd_resampler_shutdown+0x23/0x150  <-- offset 0x23 
(mutex_lock)

Instance 2 (t=343s):

  Worker pid 4241 — MUTEX HOLDER, stuck in synchronize_srcu:

    [  343.193294] task:kworker/5:4     state:D  pid:4241
    [  343.193299] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
    [  343.193328]  __synchronize_srcu+0x100/0x130
    [  343.193335]  irqfd_resampler_shutdown+0xf0/0x150  <-- offset 0xf0 
(synchronize_srcu)

  Workers pid 151, 4243, 4246 — MUTEX WAITERS:

    [  343.193369] task:kworker/5:6     state:D  pid:4243
    [  343.193397]  __mutex_lock+0x37d/0xbb0
    [  343.193397]  irqfd_resampler_shutdown+0x23/0x150  <-- offset 0x23 
(mutex_lock)

Both instances show the identical wait-for cycle:

  1. One worker holds resampler_lock, blocks in __synchronize_srcu
     (waiting for SRCU grace period)
  2. SRCU GP needs process_srcu to run — but it stays "pending"
     on the same pool
  3. Other irqfd workers block on __mutex_lock in the same pool
  4. The pool is marked "hung" and no pending work makes progress
     for 250-300 seconds until kernel panic

> 4) Is the behavior reproducible in both irqfd_resampler_shutdown()
> and kvm_irqfd_assign() paths?

In our 4 crash instances the stuck mutex holder is always in 
irqfd_resampler_shutdown() at offset 0xf0 (synchronize_srcu).  This 
is consistent — these are all VM shutdown scenarios where only 
irqfd_shutdown workqueue items run.

The kvm_irqfd_assign() path was identified by Vineeth Pillai (Google)
during a VM create/destroy stress test where assign and shutdown race.
His traces showed kvm_irqfd (the assign path) stuck in
synchronize_srcu_expedited with irqfd_resampler_shutdown blocked on
the mutex, and workqueue pwq 46 at active=1024 refcnt=2062.

> If SRCU GP remains independent, it would help distinguish whether
> this is a strict deadlock or a form of workqueue starvation / lock
> contention.

Based on the data from both instances, SRCU GP is NOT remaining
independent.  process_srcu stays permanently pending on the affected
per-CPU pool for 250-300 seconds.  But it's not just process_srcu —
ALL pending work on the pool is stuck, including items from events,
cgroup, mm, slub, and other workqueues.


> A timestamp-correlated dump (blocked stacks + workqueue state +
> SRCU GP activity) would likely be sufficient to classify this.

I hope the correlated dumps above from both instances are helpful.
To summarize the timeline (consistent across both):

  t=0:   VM shutdown begins, crosvm detaches irqfds
  t=~14: 4 irqfd_shutdown work items queued on WQ_PERCPU pool
         One worker acquires resampler_lock, enters synchronize_srcu
         Other 3 workers block on __mutex_lock
  t=~43: First "BUG: workqueue lockup" — pool detected stuck
         rcu_gp: process_srcu shown as "pending" on same pool
  t=~93  Through t=~312: Repeated dumps every ~30s
         process_srcu remains permanently "pending"
         Pool has idle workers but no pending work executes
  t=~314: Hung task dump confirms mutex holder in __synchronize_srcu
  t=~316: init triggers sysrq crash → kernel panic

> Happy to help look at traces if available.

I can share the full console-ramoops-0 and dmesg-ramoops-0 from both
instances.  Shall I post them or send them off-list?

Thanks,
Sonam

Reply via email to