On Wed, Jun 03, 2026 at 01:42:25PM +0100, Matthew Auld wrote:
> On 03/06/2026 13:06, Sanjay Yadav wrote:
> > guc_exec_queue_timedout_job() unconditionally bans the queue once a
> > job times out. For the kernel migration queue this is fatal — once
> > banned, no page table migrations can complete and the GPU is
> > effectively dead until driver reload.
> >
> > The submission is already stopped and the timed-out job is erred out,
> > so banning is not needed for correctness. GT reset handles the actual
> > hardware recovery. Skip banning for kernel queues so they remain
> > available after reset.
>
> Is wedging/reload not the more correct thing here? Kernel job is usually
> performing critical and potentially security sensitive work, like memory
> clearing, migrations, binding etc. If something goes wrong in one of those
> jobs, how should we go about recovering from that? Is driver reload/wedge
> not the more appropriate thing here, or least would need a more elaborate
> recovery?
>
> For example, memclear get nuked, what stops the user from accessing
> uncleared memory later? Or a migration/copy/save/restore/ job gets nuked,
> from correctness pov how do we recover from that?
I agree with Matt here something is off. we cannot blindly skip these
kernel submission cases... (This and the other patch in this series)
>
> >
> > Fixes: bb63e7257e63 ("drm/xe: Avoid toggling schedule state to check LRC
> > timestamp in TDR")
> > Cc: Matthew Brost <[email protected]>
> > Cc: Thomas Hellström <[email protected]>
> > Cc: Rodrigo Vivi <[email protected]>
> > Assisted-by: Claude:claude-opus-4.6
> > Suggested-by: Himal Prasad Ghimiray <[email protected]>
> > Signed-off-by: Sanjay Yadav <[email protected]>
> > ---
> > drivers/gpu/drm/xe/xe_guc_submit.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> > b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index ab501513d806..e6ad57cbbf0e 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -1543,7 +1543,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job
> > *drm_job)
> > if (!exec_queue_killed(q))
> > wedged = guc_submit_hint_wedged(exec_queue_to_guc(q));
> > - set_exec_queue_banned(q);
> > + if (!(q->flags & EXEC_QUEUE_FLAG_KERNEL))
> > + set_exec_queue_banned(q);
> > /* Kick job / queue off hardware */
> > if (!wedged && (exec_queue_enabled(primary) ||
>