During device hotplug unbind-rebind cycles, the i915 driver crashes with
a BUG_ON in intel_wakeref.h when retiring stale requests that outlive the
device unbind. The crash occurs because pending requests in timelines are
not forced to retire before device teardown. Upon rebind, fresh engine
structures are created with new PM wakeref counters initialized to zero.
If a stale request from the previous device instance is still queued,
it will execute in the retire worker and attempt to drop a PM wakeref
that was never acquired, causing underflow.
```
<2> [368.095702] kernel BUG at ./drivers/gpu/drm/i915/intel_wakeref.h:157!
...
<4> [368.099735] Workqueue: i915-unordered engine_retire [i915]
...
<4> [368.100280] Call Trace:
<4> [368.100280]  <TASK>
<4> [368.100280]  intel_context_exit+0xf1/0x1b0 [i915]
<4> [368.100280]  ? i915_request_retire.part.0+0xb0/0x520 [i915]
<4> [368.106309]  i915_request_retire.part.0+0x1b9/0x520 [i915]
<4> [368.107123]  i915_request_retire+0x1c/0x40 [i915]
<4> [368.107123]  engine_retire+0x122/0x180 [i915]
<4> [368.109586]  process_one_work+0x239/0x740
<4> [368.109586]  worker_thread+0x200/0x3f0
<4> [368.109586]  ? __pfx_worker_thread+0x10/0x10
<4> [368.109586]  kthread+0x10d/0x150
<4> [368.109586]  ? __pfx_kthread+0x10/0x10
<4> [368.109586]  ret_from_fork+0x3bd/0x470
<4> [368.109586]  ? __pfx_kthread+0x10/0x10
<4> [368.109586]  ret_from_fork_asm+0x1a/0x30
<4> [368.109586]  </TASK>
```

The fix forces retirement of all pending requests in
intel_gt_fini_requests() before cancelling the delayed work. This ensures
requests are fully retired before engines are torn down, preventing them
from reexecuting on a freshly initialized device. A check
is also added to intel_context_exit_engine() to safely skip the engine
PM put if the wakeref count is already zero, providing a safety net for
any remaining races.

Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/16037
Fixes: dea397e818b1 ("drm/i915/gt: Flush retire.work timer object on unload")
Signed-off-by: Sebastian Brzezinka <[email protected]>
---
 drivers/gpu/drm/i915/gt/intel_context.c     | 5 +++++
 drivers/gpu/drm/i915/gt/intel_gt_requests.c | 3 +++
 2 files changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c 
b/drivers/gpu/drm/i915/gt/intel_context.c
index b1b8695ba7c9..90fc755f551a 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -475,6 +475,11 @@ void intel_context_enter_engine(struct intel_context *ce)
 
 void intel_context_exit_engine(struct intel_context *ce)
 {
+       if (unlikely(atomic_read(&ce->engine->wakeref.count) <= 0)) {
+               intel_timeline_exit(ce->timeline);
+               return;
+       }
+
        intel_timeline_exit(ce->timeline);
        intel_engine_pm_put(ce->engine);
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_requests.c 
b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
index 93298820bee2..8f22438bc5d9 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_requests.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
@@ -230,6 +230,9 @@ void intel_gt_unpark_requests(struct intel_gt *gt)
 
 void intel_gt_fini_requests(struct intel_gt *gt)
 {
+       intel_gt_retire_requests(gt);
+       flush_delayed_work(&gt->requests.retire_work);
+
        /* Wait until the work is marked as finished before unloading! */
        cancel_delayed_work_sync(&gt->requests.retire_work);
 
-- 
2.53.0

Reply via email to