On 03/06/2026 13:06, Sanjay Yadav wrote:
guc_exec_queue_timedout_job() unconditionally bans the queue once a
job times out. For the kernel migration queue this is fatal — once
banned, no page table migrations can complete and the GPU is
effectively dead until driver reload.
The submission is already stopped and the timed-out job is erred out,
so banning is not needed for correctness. GT reset handles the actual
hardware recovery. Skip banning for kernel queues so they remain
available after reset.
Is wedging/reload not the more correct thing here? Kernel job is usually
performing critical and potentially security sensitive work, like memory
clearing, migrations, binding etc. If something goes wrong in one of
those jobs, how should we go about recovering from that? Is driver
reload/wedge not the more appropriate thing here, or least would need a
more elaborate recovery?
For example, memclear get nuked, what stops the user from accessing
uncleared memory later? Or a migration/copy/save/restore/ job gets
nuked, from correctness pov how do we recover from that?
Fixes: bb63e7257e63 ("drm/xe: Avoid toggling schedule state to check LRC timestamp
in TDR")
Cc: Matthew Brost <[email protected]>
Cc: Thomas Hellström <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Assisted-by: Claude:claude-opus-4.6
Suggested-by: Himal Prasad Ghimiray <[email protected]>
Signed-off-by: Sanjay Yadav <[email protected]>
---
drivers/gpu/drm/xe/xe_guc_submit.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
b/drivers/gpu/drm/xe/xe_guc_submit.c
index ab501513d806..e6ad57cbbf0e 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -1543,7 +1543,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
if (!exec_queue_killed(q))
wedged = guc_submit_hint_wedged(exec_queue_to_guc(q));
- set_exec_queue_banned(q);
+ if (!(q->flags & EXEC_QUEUE_FLAG_KERNEL))
+ set_exec_queue_banned(q);
/* Kick job / queue off hardware */
if (!wedged && (exec_queue_enabled(primary) ||