During device hotplug unbind-rebind cycles, the i915 driver crashes with a BUG_ON in intel_wakeref.h when retiring stale requests that outlive the device unbind. The crash occurs because pending requests in timelines are not forced to retire before device teardown. Upon rebind, fresh engine structures are created with new PM wakeref counters initialized to zero. If a stale request from the previous device instance is still queued, it will execute in the retire worker and attempt to drop a PM wakeref that was never acquired, causing underflow. ``` <2> [368.095702] kernel BUG at ./drivers/gpu/drm/i915/intel_wakeref.h:157! ... <4> [368.099735] Workqueue: i915-unordered engine_retire [i915] ... <4> [368.100280] Call Trace: <4> [368.100280] <TASK> <4> [368.100280] intel_context_exit+0xf1/0x1b0 [i915] <4> [368.100280] ? i915_request_retire.part.0+0xb0/0x520 [i915] <4> [368.106309] i915_request_retire.part.0+0x1b9/0x520 [i915] <4> [368.107123] i915_request_retire+0x1c/0x40 [i915] <4> [368.107123] engine_retire+0x122/0x180 [i915] <4> [368.109586] process_one_work+0x239/0x740 <4> [368.109586] worker_thread+0x200/0x3f0 <4> [368.109586] ? __pfx_worker_thread+0x10/0x10 <4> [368.109586] kthread+0x10d/0x150 <4> [368.109586] ? __pfx_kthread+0x10/0x10 <4> [368.109586] ret_from_fork+0x3bd/0x470 <4> [368.109586] ? __pfx_kthread+0x10/0x10 <4> [368.109586] ret_from_fork_asm+0x1a/0x30 <4> [368.109586] </TASK> ```
The fix forces retirement of all pending requests in intel_gt_fini_requests() before cancelling the delayed work. This ensures requests are fully retired before engines are torn down, preventing them from reexecuting on a freshly initialized device. A check is also added to intel_context_exit_engine() to safely skip the engine PM put if the wakeref count is already zero, providing a safety net for any remaining races. Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/16037 Fixes: dea397e818b1 ("drm/i915/gt: Flush retire.work timer object on unload") Signed-off-by: Sebastian Brzezinka <[email protected]> --- drivers/gpu/drm/i915/gt/intel_context.c | 5 +++++ drivers/gpu/drm/i915/gt/intel_gt_requests.c | 3 +++ 2 files changed, 8 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c index b1b8695ba7c9..90fc755f551a 100644 --- a/drivers/gpu/drm/i915/gt/intel_context.c +++ b/drivers/gpu/drm/i915/gt/intel_context.c @@ -475,6 +475,11 @@ void intel_context_enter_engine(struct intel_context *ce) void intel_context_exit_engine(struct intel_context *ce) { + if (unlikely(atomic_read(&ce->engine->wakeref.count) <= 0)) { + intel_timeline_exit(ce->timeline); + return; + } + intel_timeline_exit(ce->timeline); intel_engine_pm_put(ce->engine); } diff --git a/drivers/gpu/drm/i915/gt/intel_gt_requests.c b/drivers/gpu/drm/i915/gt/intel_gt_requests.c index 93298820bee2..8f22438bc5d9 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_requests.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_requests.c @@ -230,6 +230,9 @@ void intel_gt_unpark_requests(struct intel_gt *gt) void intel_gt_fini_requests(struct intel_gt *gt) { + intel_gt_retire_requests(gt); + flush_delayed_work(>->requests.retire_work); + /* Wait until the work is marked as finished before unloading! */ cancel_delayed_work_sync(>->requests.retire_work); -- 2.53.0
