Re: [PATCH] drm/sched: run free_job work on timeout workqueue

Matthew Brost Wed, 10 Dec 2025 09:08:55 -0800

On Wed, Dec 10, 2025 at 02:06:15PM +0100, Philipp Stanner wrote:
> On Wed, 2025-12-10 at 13:47 +0100, Christian König wrote:
> > On 12/10/25 10:58, Philipp Stanner wrote:
> > > On Tue, 2025-12-09 at 17:58 -0800, Matthew Brost wrote:
> > > > On Tue, Dec 09, 2025 at 03:19:40PM +0100, Christian König wrote:
> > ..
> > > > > > > My educated guess is that drm_sched_stop() inserted the job back 
> > > > > > > into the pending list, but I still have no idea how it is 
> > > > > > > possible that free_job is running after the scheduler is stopped.
> > > > 
> > > > I believe I found your problem, referencing amdgpu/amdgpu_device.c here.
> > > > 
> > > > 6718                 if (job)
> > > > 6719                         ti = amdgpu_vm_get_task_info_pasid(adev, 
> > > > job->pasid);
> > 
> > WTF! There it is! Thanks a lot for pointing that out!


No problem.

> 
> scripts/decode_stacktrace.sh should be able to find the exact location
> of a UAF. Requires manual debugging with the kernel build tree at
> hands, though. So that's difficult in CIs.
> 

I wasn't aware of this. Usually I do below after rebuilding kernel with
debug symbols.

gdb <offending_file.o>
list *(<last stack trace line>)

Here I just saw 'last stack trace line' was at the very end of
amdgpu_device_gpu_recover and happend to spot this.

> > 
> > > > 
> > > > Which is after:
> > > > 
> > > > 6695 skip_hw_reset:
> > > > 6696         r = amdgpu_device_sched_resume(&device_list, 
> > > > reset_context, job_signaled);
> > > > 6697         if (r)
> > > > 6698                 goto reset_unlock;
> > > > 
> > > > The job is likely added back into this free list here:
> > > > 
> > > > 6676         amdgpu_device_halt_activities(adev, job, reset_context, 
> > > > &device_list,
> > > > 6677                                       hive, 
> > > > need_emergency_restart);
> > > > 
> > > > So free_job runs and 'job->pasid' explodes.
> > 
> > I've read over that code like a hundred times and didn't realized that the 
> > job is accessed after the scheduler resumes.
> > 
> > > > 
> > > > Save off the pasid on the stack at top of this function and I suspect
> > > > your UAF goes away. This won't untangle this hair ball of code but I
> > > > believe this at least prevent explosions.
> > > > 
> > > > But let’s dig in further—amdgpu_device_halt_activities calls
> > > > drm_sched_stop (Xe just calls drm_sched_wqueue_stop for reference). This
> > > > function stops the work items, then adds the offending job back to the
> > > > pending list, iterates over each job, removes the CB, leaving the job in
> > > > the pending list. If the CB can be removed, it removes the job from
> > > > pending, maybe calls free_job if it’s not a guilty job, and if it is a
> > > > guilty job, defers the free_job to the timed-out job so it doesn’t
> > > > disappear. Like WTF?
> > > > 
> > > > Oh, it gets better—amdgpu_device_sched_resume calls drm_sched_start,
> > > > which iterates over the pending list and reinserts the same CB that
> > > > drm_sched_stop removed, then starts the scheduler. So the guilty job had
> > > > its CB successfully removed, and now it can immediately disappear—also
> > > > like WTF?
> > > > 
> > > > Free_guilty is clearly a hack around the job not being reference
> > > > counted, and it doesn’t even work in some cases. Putting that
> > > > aside, I think calling free_job shouldn’t really ever happen in TDR.
> > > > Imagine if drm_sched_job_timedout just took a reference to the job like
> > > > normal kernel code—free_guilty could be dropped, and immediately this
> > > > all becomes safe. Likewise, if the run_job work item had a reference to
> > > > the job, which it takes before adding to the pending list and drops
> > > > after it’s done touching it in this function, then run_job and free_job
> > > > work items could safely execute in parallel rather than relying on an
> > > > ordered workqueue to keep that part of the code safe.
> > > 
> > > I can tell you how I design it in our Rust jobqueue:
> > > Drivers create jobs, and in submit_job() the pass ownership over the
> > > job to the jobqueue – IOW after pushing a job, a driver can't access it
> > > anymore. In the run_job() callback, the jobqueue either passes the job
> > > back by value (ownership) or borrows the job to the driver so that it
> > > can be copied (this is done so that the JQ can hypothetically do
> > > resubmits).
> > > 
> > > This way there is no need for refcounting (in Rust / jobqueue).
> > > 

See below. If you need to resubmit for any reason, where will the
information for resubmission be stored? Likewise, if you want to drop
additional references on fence signaling, how are you going to retrieve
that?

> > > Maybe the core of the problem is not so much the lack of refcounting,
> > > but the lack of ownership rules. Why even would the driver need the job
> > > still after it got pushed? It should be fire-and-forget.
> > 
> > Yeah, that sounds sane to me as well and is exactly how it was initially 
> > designed in the drm_scheduler as well.
> > 
> > The job is basically just the information the driver needs for the 
> > submission which it gives to the scheduler on push, and the scheduler gives 
> > back to the driver on pop.
> > 
> > The full execution time is represented by the scheduler fence and not the 
> > job. And the scheduler fence is reference counted exactly because of the 
> > reasons Mathew brought up here.
> 
> Would be interesting to hear where Xe would still need the job. If only
> the backend_ops give a driver access to a job again after it got
> pushed, then it should be safe.
> 

Xe needs a subset of the job after submission to handle tasks like
resubmission after a device reset. It’s questionable whether we need
this, as it shouldn’t happen in practice—only individual queues should
fail with a working KMD and hardware. It likely doesn’t work anyway if
queues have interdependencies. This is really an opportunistic approach.

However, we absolutely need this for VF migration resubmission. Again,
this requires only a very small subset of driver job information. I
believe it’s just the starting point in the ring, batch address(es), and
a pointer to the driver-side queue object.

We also build a reference-counting model around jobs, where the final
put releases other objects and runtime power management references. This
assumes that the job’s final put means the scheduler fence is signaled.
Again, this is really just a small subset of information we need here.

So if we add hooks to store the subset of information Xe needs for
everything above in the scheduler fence and a non-IRQ, pausable callback
(i.e., won’t execute when the scheduler is stopped, like free_job), this
could be made to work. We really don’t need about 90% of the information
in the job and certainly nothing in the base object.

This would be major surgery, though. I suspect most drivers have a
subset of information in the job that needs to stick around until it
signals, so this means surgery across 11 drivers.

> > 
> > I'm absolutely not against reference counting, what I'm pushing back is 
> > abusing the job object as something it was never designed for while we 
> > already have an object which implements exactly the needed functionality.

Well, oops. Having free_job called after the fence is signaled is how I
arrived at the implementation in Xe. I agree that if we can move driver
info into the scheduler fence, this could work for likely everyone.

> > > > > > And my uneducated guess is that it's happening in amdgpu. It seems a
> > > > > > sched_job lives inside an amdgpu_job. Can the latter be freed at 
> > > > > > other
> > > > > > places than free_job()?
> > > > > 
> > > > > > > 
> > > 
> > > […]
> > > 
> > > > > > > It basically says to the driver that the job lifetime problems 
> > > > > > > created by the scheduler is the driver problem and need to be 
> > > > > > > worked around there.
> > > > > > > 
> > > > > > 
> > > > > > My POV still mostly is that (with the current design) the driver 
> > > > > > must
> > > > > > not use jobs after free_job() was invoked. And when that happens is
> > > > > > unpredictable.
> > > > > > 
> > > > 
> > > > This is somewhat of an absurd statement from my point of view. I have a
> > > > valid job pointer, then I call another function (see above for an
> > > > example of how drm_sched_start/stop is unsafe) and it disappears behind
> > > > my back.
> > > > 
> > > 
> > > The statement is absurd because reality (the code) is absurd. We all
> > > are basically Alice in Wonderland, running as fast as we can just to
> > > remain on the same spot ^_^
> > > 
> > > What I am stating is not that this is *good*, but this is what it
> > > currently is like. Whether we like it or not.
> > > 
> > > The misunderstanding you and I might have is that for me jobs having to
> > > be refcounted is not a reality until it's reflected in code,
> > > documentation and, ideally, drivers.
> > > 

I agree this part is badly misdesigned. In the timedout job callback,
you’re handed a job, and if you perform certain actions, it can just
disappear— even all the way back to the caller of timedout_job. That’s
not great. Then we have this free_guilty mechanism to avoid it
disappearing, but sometimes it still does, which is horrible.

> > > >  The safe way to handle this is to take a local reference before
> > > > doing anything that could make it disappear. That is much more
> > > > reasonable than saying, “we have five things you can do in the
> > > > scheduler, and if you do any of them it isn’t safe to touch the job
> > > > afterward.”
> > > 
> > > Yeah, but that's drm_sched being drm_scheddy. Before I documented it
> > > there were also these implicit refcounting rules in run_job(), where
> > > the driver needs to take the reference for the scheduler for it to be
> > > non-racy.
> > > 

Yes, agreed. This is my fault for not being more responsible in fixing
issues rather than just backing away from these really scary parts of
the code (e.g., drm_sched_stop, drm_sched_start,
drm_sched_resubmit_jobs, etc.) and doing something sane in Xe by using
only a subset of the scheduler.

> > > It also wasn't documented for a long time that drm_sched (through
> > > spsc_queue) will explode if you don't use entities with a single
> > > producer thread.
> > 
> > That is actually documented, but not on the scheduler but rather the 
> > dma_fence.
> > 

spsc - "Single producer, Single consumer". So it is in the name.

> > And that you can only have a single producer is a requirement inherited 
> > from the dma_fence and not scheduler specific at all.
> 
> What does dma_fence have to do with it? It's about the spsc_queue being
> racy like mad. You can access and modify dma_fence's in parallel
> however you want – they are refcounted and locked.
> 
> 
> P.
> 
> > 
> > > drm_sched got here because of gross design mistakes, lots of hacks for
> > > few drivers, and, particularly, a strange aversion¹ against writing
> > > documentation. If Xe came, back in the day, to the conclusion that job
> > > lifetimes are fundamentally broken and that the objectively correct way
> > > to solve this is refcounting, then why wasn't that pushed into
> > > drm_sched back then?
> > > 
> > > > 
> > > > > > To be unfair, we already have strange refcount expectations already.
> > > > > > But I sort of agree that there is no objectively good solution in
> > > > > > sight.
> > > > > > 
> > > > > > > > 
> > > > > > > > Let me respin to my documentation series and upstream that 
> > > > > > > > soonish,
> > > > > > > > than we can build on top of that.
> > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > P.
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 2. Move reference counting to the base DRM scheduler job 
> > > > > > > > > object, provide a
> > > > > > > > >    vfunc for the final job put, and document usage. Medium 
> > > > > > > > > invasive.
> > > > > > > 
> > > > > > > I strongly think that reference counting the job object just 
> > > > > > > because the scheduler needs it is a bad idea.
> > > > > > > 
> > > > > > > With that we are just moving the hot potato from one side of our 
> > > > > > > mouth to the other without really solving the issue.
> > > > > > > 
> > > > 
> > > > See above—I can’t say I agree with this assessment. I think the lack of
> > > > reference counting is exactly the problem. I don’t really understand the
> > > > pushback on a very well-understood concept (reference counting) in
> > > > Linux. I would sign up to fix the entire subsystem if we go this route.
> > > > 
> > > > > > > If a driver like XE needs that for some reason than that is 
> > > > > > > perfectly fine.
> > > > > > 
> > > > > > Nouveau doesn't need it either.
> > > > > > 
> > > > > > > 
> > > > > > > > > 3. Move job (driver) side tracking to the scheduler fence and 
> > > > > > > > > let it
> > > > > > > > >    control the lifetime. Very invasive.
> > > > > > > 
> > > > > > > Thinking about it more that is actually not so much of a problem.
> > > > > > > 
> > > > > > > Let me try to code something together by the end of next week or 
> > > > > > > so.
> > > > > > 
> > > > > > The hero Gotham needs :)
> > > > > > 
> > > > 
> > > > Are you sure about this one? I think unless the problems around
> > > > drm_sched_start/stop and free_guilty are fixed, my feeling is this
> > > > entire thing is still badly broken for anyone who wants to use those.
> > > > 
> > > > To sum up this whole email: I strongly disagree with option #3, but if
> > > > that is the consensus, I will, of course, support the effort.
> > > 
> > > 
> > > I would like to discuss those topics with Danilo, too, who returns from
> > > LPC soonish. Also to get some more insights into Nouveau and our use-
> > > cases.
> > > 
> > > My suggestion is that we pick the conversation up again soonish.
> > > Christmas is around the corner, and I suppose we can't fix this all up
> > > in 2025 anyways, so we might want to give it a fresh re-spin in '26.
> > 
> > Since we finally found the root cause I'm all in postponing that till next 
> > year.
> > 

Ok, glad we found the root cause. I’m not as opposed to option #3 as
stated—this was a bit of angry typing—but if we go in that direction, we
really need clear rules, for example:

- A job cannot be referenced driver-side after the initial
  drm_sched_entity_push_job call, aside from a single run_job callback.
  Maybe run_job is actually replaced by a scheduler fence input?
- Drivers can attach information to the scheduler fence and control its
  lifetime.
- Drivers can iterate over pending scheduler fences when the scheduler
  is stopped.
- Drivers receive a pausable callback in a non-IRQ context when the
  scheduler fence signals.

etc...

Again, this is a pretty major change. I personally wouldn’t feel
comfortable hacking 11 drivers—10 of which aren’t mine—to do something
like this. Refcounting the job would be less invasive and would make the
existing hairball of code safe.

Matt

> > Christian.
> > 
> > > 
> > > 
> > > Greetings,
> > > P.
> > > 
> > > 
> > > 
> > > [¹] The strangest aversion in drm_sched, however, is the one against
> > > locking. Locks were only taken when *absolutely* necessary. It's as if
> > > the entire code was designed by some gamers who want to show their
> > > youtube subscribers that they can get 1fps more by changing RAM timings
> > > in the bios.
> > > drm_sched might be the component in the kernel using the most
> > > synchronization mechanisms: Spinlocks, RCU, atomic integers, atomic
> > > instructions, refcounting, READ_ONCE(), just accessing locklessly… even
> > > Paul McKenney would get wide eyes here. The only thing we're still
> > > missing is seqlocks and rw_locks, but maybe we can add those /s
> > > 
> > > That's likely sth we can't get repaired at all anymore.
> > 
>

Re: [PATCH] drm/sched: run free_job work on timeout workqueue

Reply via email to