During device hotplug unbind-rebind cycles, the i915 driver crashes with
a BUG_ON in intel_wakeref.h when retiring stale requests that outlive the
device unbind. The crash occurs because pending requests in timelines are
not forced to retire before device teardown. Upon rebind, fresh engine
structures are created with new PM wakeref counters initialized to zero.
If a stale request from the previous device instance is still queued,
it will execute in the retire worker and attempt to drop a PM wakeref
that was never acquired, causing underflow.
```
<2> [368.095702] kernel BUG at ./drivers/gpu/drm/i915/intel_wakeref.h:157!
...
<4> [368.099735] Workqueue: i915-unordered engine_retire [i915]
...
<4> [368.100280] Call Trace:
<4> [368.100280]  <TASK>
<4> [368.100280]  intel_context_exit+0xf1/0x1b0 [i915]
<4> [368.100280]  ? i915_request_retire.part.0+0xb0/0x520 [i915]
<4> [368.106309]  i915_request_retire.part.0+0x1b9/0x520 [i915]
<4> [368.107123]  i915_request_retire+0x1c/0x40 [i915]
<4> [368.107123]  engine_retire+0x122/0x180 [i915]
<4> [368.109586]  process_one_work+0x239/0x740
<4> [368.109586]  worker_thread+0x200/0x3f0
<4> [368.109586]  ? __pfx_worker_thread+0x10/0x10
<4> [368.109586]  kthread+0x10d/0x150
<4> [368.109586]  ? __pfx_kthread+0x10/0x10
<4> [368.109586]  ret_from_fork+0x3bd/0x470
<4> [368.109586]  ? __pfx_kthread+0x10/0x10
<4> [368.109586]  ret_from_fork_asm+0x1a/0x30
<4> [368.109586]  </TASK>
```

The fix forces retirement of all pending requests in
intel_gt_fini_requests() before cancelling the delayed work. This ensures
requests are fully retired before engines are torn down, preventing them
from reexecuting on a freshly initialized device.

Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/16037
Fixes: dea397e818b1 ("drm/i915/gt: Flush retire.work timer object on unload")
Signed-off-by: Sebastian Brzezinka <[email protected]>
---
v1 -> v2:
Remove flush_delayed_work() from intel_gt_fini_requests()
to fix deadlock.  retire_work_handler requeues itself, so
flush_delayed_work() followed by cancel_delayed_work_sync() races
and can deadlock. cancel_delayed_work_sync() alone is sufficient, it
prevents requeueing and waits for running work.

Drop the wakeref guard from intel_context_exit_engine(). Skipping
intel_engine_pm_put() when the wakeref count is already zero masks the
symptom rather than fixing the root cause, and silently hide any future
stale request scenarios through the same path, making them harder to
diagnose.
---
 drivers/gpu/drm/i915/gt/intel_gt_requests.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_requests.c 
b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
index 93298820bee2..99a58951c40a 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_requests.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
@@ -230,6 +230,8 @@ void intel_gt_unpark_requests(struct intel_gt *gt)
 
 void intel_gt_fini_requests(struct intel_gt *gt)
 {
+       intel_gt_retire_requests(gt);
+
        /* Wait until the work is marked as finished before unloading! */
        cancel_delayed_work_sync(&gt->requests.retire_work);
 
-- 
2.53.0

Reply via email to