Re: [PATCH] drm/sched: run free_job work on timeout workqueue

Matthew Brost Fri, 05 Dec 2025 09:31:13 -0800

On Fri, Dec 05, 2025 at 11:18:21AM +0100, Christian König wrote:
> On 12/1/25 10:04, Philipp Stanner wrote:
> > +Cc Matthew, Danilo, who are also drm/sched maintainers
> > 
> > The get_maintainer script provides you with a list of maintainers.
> > 
> > On Mon, 2025-12-01 at 09:38 +0100, Christian König wrote:
> >> On 11/28/25 19:22, [email protected] wrote:
> >>> From: Vitaly Prosyak <[email protected]>
> >>>
> >>> Currently drm_sched runs run_job and free_job on the per-scheduler
> >>> ordered submit workqueue, while timeouts (drm_sched_job_timedout())
> >>> run on sched->timeout_wq (e.g. amdgpu reset_domain->wq).
> >>>
> >>> For drivers like amdgpu the reset path entered from timeout_wq may
> >>> still access the guilty job while free_job, running on submit_wq,
> >>> frees it. This allows a use-after-free when recovery continues to
> >>> touch job fields after drm_sched_free_job_work() has called
> >>> ops->free_job().
> >>>
> >>> Queue work_free_job on sched->timeout_wq instead of submit_wq, both
> >>> from __drm_sched_run_free_queue() and drm_sched_wqueue_start(), so        
> >>>         
> >>> timeout/reset and free_job are always serialized on the same
> >>> workqueue.
> >>>
> >>> Behavior changes:
> >>>
> >>> - work_run_job stays on sched->submit_wq (ordered).
> >>> - work_free_job moves to sched->timeout_wq (timeout/reset queue).
> >>> - Submission and freeing may now run in parallel on different
> >>>   workqueues, but all shared state is already protected by
> >>>   job_list_lock and atomics.
> >>>
> >>> Pros:
> >>> - Eliminates reset vs free_job race (no freeing while reset still
> >>>   uses the job).
> > 
> > It eliminates such a race *in amdgpu*. Other drivers might not have
> > that problem. I think Intel/Xe is refcounting jobs? Matthew?
> >


We schedule device resets on the same work queue as the job timeout
queue, as well as operations like VF migrations, since conceptually they
are quite similar to device resets. In both cases, we have the
capability to stop all schedulers on the device and prevent any future
schedules from being created while these processes are ongoing.

Putting it all together: when either of these device-wide operations is
running, we know that no job timeouts will occur and no scheduler
operations (e.g., run_job, free_job, etc.) will race. I suggest that all
other drivers requiring device-wide operations follow this approach, as
it seems to work quite well. In other words, even if free_job is moved
to the timeout work queue, I’m fairly certain you would still schedule
device-wide operations on the timeout work queue and still stop all
schedulers before any device operation touches scheduler or queue state.

Therefore, I don’t believe Xe actually has a problem here.

> >>> - Matches the logical model: timeout selects guilty job and recovery,
> >>>   including freeing, is handled on one queue.
> >>>
> >>> Cons / considerations:
> >>> - For users that don’t provide timeout_wq, free_job moves from the
> >>>   per-sched ordered queue to system_wq, which slightly changes

s/system_wq/system_percpu_wq

Ok, the system_percpu_wq doesn't actually for timeout_wq as that work
queue is reclaim unsafe. We probably should just remove that fallback in
the scheduler.

> >>>   scheduling behaviour but keeps correctness.
> >>
> >> We should probably avoid that and use a single ordered wq for submit, 
> >> timeout, free when the driver doesn't provide one.

Ah, yes agree. I typed the same thing above before reading this.

> > 
> > AFAICS this fix doesn't fix anything for certain at all, because you
> > just don't know what kind of workqueue the other drivers have passed
> > for timeout_wq.
> > 
> >>
> >> We should potentially also add a warning/error when the driver supplied wq 
> >> isn't ordered.
> >>
> >> Apart from that the change looks sane to me and avoid all the hacky 
> >> workarounds around job lifetime.

A step further: the work queue also needs to be reclaim-safe. I started
a series [1] to enforce this policy but inevitably got distracted by
other priorities. Eventually, I’ll revive it—or if someone else wants to
take it on, feel free.

[1] https://patchwork.freedesktop.org/series/156284/

> > 
> > I'm not convinced that this is the right path forward.
> > 
> > The fundamental issue here is the gross design problem within drm/sched
> > that drivers *create* jobs, but the scheduler *frees* them at an
> > unpredictable point in time.
> 
> Yeah, can't agree more!
> 

The scheduler doesn’t free jobs; it simply drops a reference count.
free_job should be renamed, in my opinion, to reflect this. Once that
misnomer is fixed, the design actually makes sense. The scheduler holds
a reference to the job until its fence signals and until it is removed
from internal tracking. Transferring ownership via reference counting is
actually quite common and well understood. The scheduler takes ownership
of a ref when the job is pushed to the scheduler.

> > This entire fix idea seems to circle around the concept of relying yet
> > again on the scheduler's internal behavior (i.e., when it schedules the
> > call to free_job()).
> > 
> > I think we discussed that at XDC, but how I see it if drivers have
> > strange job life time requirements where a job shall outlive
> > drm_sched's free_job() call, they must solve that with a proper
> > synchronization mechanism.
> 
> Well that is not correct as far as I can see.
> 
> The problem here is rather that the scheduler gives the job as parameter to 
> the timedout_job() callback, but doesn't guarantee that ->free_job() callback 
> isn't called while timedout_job() runs.
>
> This should be prevented by removing the job in question from the pending 
> list (see drm_sched_job_timedout), but for some reason that doesn't seem to 
> work correctly.
> 

Are you sure this is happening? It doesn’t seem possible, nor have I
observed it.

What actually looks like a problem is that in drm_sched_job_timedout,
free_job can be called. Look at [2]—if you’re using free_guilty (Xe
isn’t, but [2] was Xe trying to do the same thing), this is actually
unsafe. The free_guilty code should likely be removed as that definitely
can explode under the right conditions.

[2] git format-patch -1 ea2f6a77d0c40

> > The first question would be: what does amdgpu need the job for after
> > free_job() ran? What do you even need a job for still after there was a
> > timeout?
> 
> No, we just need the job structure alive as long as the timedout_job() 
> callback is running.
>

Yes, I agree.

> > And if you really still need its contents, can't you memcpy() the job
> > or something?
> > 
> > Assuming that it really needs it and that that cannot easily be solved,
> > I suppose the obvious answer for differing memory life times is
> > refcounting. So amdgpu could just let drm_sched drop its reference in
> > free_job(), and from then onward it's amdgpu's problem.
> > 
> > I hope Matthew can educate us on how Xe does it.
> 
> We discussed this on XDC and it was Matthew who brought up that this can be 
> solved by running timeout and free worker on the same single threaded wq.
> 

No, see my explainations above. This is not my suggestion.

> > 
> > AFAIK Nouveau doesn't have that problem, because on timeout we just
> > terminate the channel.
> > 
> > Would also be interesting to hear whether other driver folks have the
> > problem of free_job() being racy.
> 
> I think that this is still a general problem with the drm scheduler and not 
> driver specific at all.
> 

Maybe the free_guilty is likely a scheduler problem but I'm not seeing
an issue aside from that.

Matt

> Regards,
> Christian.
> 
> > 
> > +Cc Boris, Lucas.
> > 
> > 
> > P.
> > 
> > 
> >>
> >> But removing those workarounds is should probably be a second step.
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> Cc: Philipp Stanner <[email protected]>
> >>> Cc: Alex Deucher <[email protected]>
> >>> Cc: Christian König <[email protected]>
> >>> Suggested-by: Mathew from Intel during XDC
> >>> Suggested-by: Christian König <[email protected]>
> >>> Signed-off-by: Vitaly Prosyak <[email protected]>
> >>> ---
> >>>  drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
> >>>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> >>> b/drivers/gpu/drm/scheduler/sched_main.c
> >>> index 81ad40d9582b..1243200d475e 100644
> >>> --- a/drivers/gpu/drm/scheduler/sched_main.c
> >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> >>> @@ -355,7 +355,7 @@ static void drm_sched_run_job_queue(struct 
> >>> drm_gpu_scheduler *sched)
> >>>  static void __drm_sched_run_free_queue(struct drm_gpu_scheduler *sched)
> >>>  {
> >>>   if (!READ_ONCE(sched->pause_submit))
> >>> -         queue_work(sched->submit_wq, &sched->work_free_job);
> >>> +         queue_work(sched->timeout_wq, &sched->work_free_job);
> >>>  }
> >>>  
> >>>  /**
> >>> @@ -1493,6 +1493,6 @@ void drm_sched_wqueue_start(struct 
> >>> drm_gpu_scheduler *sched)
> >>>  {
> >>>   WRITE_ONCE(sched->pause_submit, false);
> >>>   queue_work(sched->submit_wq, &sched->work_run_job);
> >>> - queue_work(sched->submit_wq, &sched->work_free_job);
> >>> + queue_work(sched->timeout_wq, &sched->work_free_job);
> >>>  }
> >>>  EXPORT_SYMBOL(drm_sched_wqueue_start);
> >>
> > 
>

Re: [PATCH] drm/sched: run free_job work on timeout workqueue

Reply via email to