guc_exec_queue_timedout_job() unconditionally bans the queue once a
job times out. For the kernel migration queue this is fatal — once
banned, no page table migrations can complete and the GPU is
effectively dead until driver reload.

The submission is already stopped and the timed-out job is erred out,
so banning is not needed for correctness. GT reset handles the actual
hardware recovery. Skip banning for kernel queues so they remain
available after reset.

Fixes: bb63e7257e63 ("drm/xe: Avoid toggling schedule state to check LRC 
timestamp in TDR")
Cc: Matthew Brost <[email protected]>
Cc: Thomas Hellström <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Assisted-by: Claude:claude-opus-4.6
Suggested-by: Himal Prasad Ghimiray <[email protected]>
Signed-off-by: Sanjay Yadav <[email protected]>
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c 
b/drivers/gpu/drm/xe/xe_guc_submit.c
index ab501513d806..e6ad57cbbf0e 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -1543,7 +1543,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
        if (!exec_queue_killed(q))
                wedged = guc_submit_hint_wedged(exec_queue_to_guc(q));
 
-       set_exec_queue_banned(q);
+       if (!(q->flags & EXEC_QUEUE_FLAG_KERNEL))
+               set_exec_queue_banned(q);
 
        /* Kick job / queue off hardware */
        if (!wedged && (exec_queue_enabled(primary) ||
-- 
2.52.0

Reply via email to