xe: skip banning kernel migration queue on TDR timeout

Hellstrom, Thomas Wed, 03 Jun 2026 08:13:25 -0700

On Wed, 2026-06-03 at 09:52 -0400, Rodrigo Vivi wrote:
> On Wed, Jun 03, 2026 at 01:42:25PM +0100, Matthew Auld wrote:
> > On 03/06/2026 13:06, Sanjay Yadav wrote:
> > > guc_exec_queue_timedout_job() unconditionally bans the queue once
> > > a
> > > job times out. For the kernel migration queue this is fatal —
> > > once
> > > banned, no page table migrations can complete and the GPU is
> > > effectively dead until driver reload.
> > > 
> > > The submission is already stopped and the timed-out job is erred
> > > out,
> > > so banning is not needed for correctness. GT reset handles the
> > > actual
> > > hardware recovery. Skip banning for kernel queues so they remain
> > > available after reset.
> > 
> > Is wedging/reload not the more correct thing here? Kernel job is
> > usually
> > performing critical and potentially security sensitive work, like
> > memory
> > clearing, migrations, binding etc. If something goes wrong in one
> > of those
> > jobs, how should we go about recovering from that? Is driver
> > reload/wedge
> > not the more appropriate thing here, or least would need a more
> > elaborate
> > recovery?
> > 
> > For example, memclear get nuked, what stops the user from accessing
> > uncleared memory later? Or a migration/copy/save/restore/ job gets
> > nuked,
> > from correctness pov how do we recover from that?
> 
> I agree with Matt here something is off. we cannot blindly skip these
> kernel submission cases... (This and the other patch in this series)


+1

Since I'm not 100% up-to-date with the scheduling I asked Claude to
look for any software-propagated errors or apparent SW bugs that could
cause the migration jobs to fail (assuming that our batches are OK
ofc), and it couldn't find one. It seems like the drm scheduler handles
the dependency timeout case correctly.

If there are recoverable resets that might occasionally spill over to
the migration queue we should figure that out and in that case consider
implementing async SW clearing / migration for the failed jobs (like
i915) to avoid wedging. But I'd rather see us not doing that if it's
not strictly necessary with a proven reproducer.

Thanks,
Thomas


> 
> > 
> > > 
> > > Fixes: bb63e7257e63 ("drm/xe: Avoid toggling schedule state to
> > > check LRC timestamp in TDR")
> > > Cc: Matthew Brost <[email protected]>
> > > Cc: Thomas Hellström <[email protected]>
> > > Cc: Rodrigo Vivi <[email protected]>
> > > Assisted-by: Claude:claude-opus-4.6
> > > Suggested-by: Himal Prasad Ghimiray
> > > <[email protected]>
> > > Signed-off-by: Sanjay Yadav <[email protected]>
> > > ---
> > >   drivers/gpu/drm/xe/xe_guc_submit.c | 3 ++-
> > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > index ab501513d806..e6ad57cbbf0e 100644
> > > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > @@ -1543,7 +1543,8 @@ guc_exec_queue_timedout_job(struct
> > > drm_sched_job *drm_job)
> > >           if (!exec_queue_killed(q))
> > >                   wedged =
> > > guc_submit_hint_wedged(exec_queue_to_guc(q));
> > > - set_exec_queue_banned(q);
> > > + if (!(q->flags & EXEC_QUEUE_FLAG_KERNEL))
> > > +         set_exec_queue_banned(q);
> > >           /* Kick job / queue off hardware */
> > >           if (!wedged && (exec_queue_enabled(primary) ||
> >

Re: [RFC PATCH 1/3] drm/xe: skip banning kernel migration queue on TDR timeout

Reply via email to