During device hotplug unbind-rebind cycles, the i915 driver crashes with a BUG_ON in intel_wakeref.h when retiring stale requests that outlive the device unbind. The crash occurs because pending requests in timelines are not forced to retire before device teardown. Upon rebind, fresh engine structures are created with new PM wakeref counters initialized to zero. If a stale request from the previous device instance is still queued, it will execute in the retire worker and attempt to drop a PM wakeref that was never acquired, causing underflow. ``` <2> [368.095702] kernel BUG at ./drivers/gpu/drm/i915/intel_wakeref.h:157! ... <4> [368.099735] Workqueue: i915-unordered engine_retire [i915] ... <4> [368.100280] Call Trace: <4> [368.100280] <TASK> <4> [368.100280] intel_context_exit+0xf1/0x1b0 [i915] <4> [368.100280] ? i915_request_retire.part.0+0xb0/0x520 [i915] <4> [368.106309] i915_request_retire.part.0+0x1b9/0x520 [i915] <4> [368.107123] i915_request_retire+0x1c/0x40 [i915] <4> [368.107123] engine_retire+0x122/0x180 [i915] <4> [368.109586] process_one_work+0x239/0x740 <4> [368.109586] worker_thread+0x200/0x3f0 <4> [368.109586] ? __pfx_worker_thread+0x10/0x10 <4> [368.109586] kthread+0x10d/0x150 <4> [368.109586] ? __pfx_kthread+0x10/0x10 <4> [368.109586] ret_from_fork+0x3bd/0x470 <4> [368.109586] ? __pfx_kthread+0x10/0x10 <4> [368.109586] ret_from_fork_asm+0x1a/0x30 <4> [368.109586] </TASK> ```
The fix forces retirement of all pending requests in intel_gt_fini_requests() before cancelling the delayed work. This ensures requests are fully retired before engines are torn down, preventing them from reexecuting on a freshly initialized device. Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/16037 Fixes: dea397e818b1 ("drm/i915/gt: Flush retire.work timer object on unload") Signed-off-by: Sebastian Brzezinka <[email protected]> --- v1 -> v2: Remove flush_delayed_work() from intel_gt_fini_requests() to fix deadlock. retire_work_handler requeues itself, so flush_delayed_work() followed by cancel_delayed_work_sync() races and can deadlock. cancel_delayed_work_sync() alone is sufficient, it prevents requeueing and waits for running work. Drop the wakeref guard from intel_context_exit_engine(). Skipping intel_engine_pm_put() when the wakeref count is already zero masks the symptom rather than fixing the root cause, and silently hide any future stale request scenarios through the same path, making them harder to diagnose. --- drivers/gpu/drm/i915/gt/intel_gt_requests.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/intel_gt_requests.c b/drivers/gpu/drm/i915/gt/intel_gt_requests.c index 93298820bee2..99a58951c40a 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_requests.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_requests.c @@ -230,6 +230,8 @@ void intel_gt_unpark_requests(struct intel_gt *gt) void intel_gt_fini_requests(struct intel_gt *gt) { + intel_gt_retire_requests(gt); + /* Wait until the work is marked as finished before unloading! */ cancel_delayed_work_sync(>->requests.retire_work); -- 2.53.0
