Re: [PATCH] drm/sched: run free_job work on timeout workqueue

Philipp Stanner Wed, 10 Dec 2025 05:06:30 -0800

On Wed, 2025-12-10 at 13:47 +0100, Christian König wrote:
> On 12/10/25 10:58, Philipp Stanner wrote:
> > On Tue, 2025-12-09 at 17:58 -0800, Matthew Brost wrote:
> > > On Tue, Dec 09, 2025 at 03:19:40PM +0100, Christian König wrote:
> ..
> > > > > > My educated guess is that drm_sched_stop() inserted the job back 
> > > > > > into the pending list, but I still have no idea how it is possible 
> > > > > > that free_job is running after the scheduler is stopped.
> > > 
> > > I believe I found your problem, referencing amdgpu/amdgpu_device.c here.
> > > 
> > > 6718                 if (job)
> > > 6719                         ti = amdgpu_vm_get_task_info_pasid(adev, 
> > > job->pasid);
> 
> WTF! There it is! Thanks a lot for pointing that out!


scripts/decode_stacktrace.sh should be able to find the exact location
of a UAF. Requires manual debugging with the kernel build tree at
hands, though. So that's difficult in CIs.

> 
> > > 
> > > Which is after:
> > > 
> > > 6695 skip_hw_reset:
> > > 6696         r = amdgpu_device_sched_resume(&device_list, reset_context, 
> > > job_signaled);
> > > 6697         if (r)
> > > 6698                 goto reset_unlock;
> > > 
> > > The job is likely added back into this free list here:
> > > 
> > > 6676         amdgpu_device_halt_activities(adev, job, reset_context, 
> > > &device_list,
> > > 6677                                       hive, need_emergency_restart);
> > > 
> > > So free_job runs and 'job->pasid' explodes.
> 
> I've read over that code like a hundred times and didn't realized that the 
> job is accessed after the scheduler resumes.
> 
> > > 
> > > Save off the pasid on the stack at top of this function and I suspect
> > > your UAF goes away. This won't untangle this hair ball of code but I
> > > believe this at least prevent explosions.
> > > 
> > > But let’s dig in further—amdgpu_device_halt_activities calls
> > > drm_sched_stop (Xe just calls drm_sched_wqueue_stop for reference). This
> > > function stops the work items, then adds the offending job back to the
> > > pending list, iterates over each job, removes the CB, leaving the job in
> > > the pending list. If the CB can be removed, it removes the job from
> > > pending, maybe calls free_job if it’s not a guilty job, and if it is a
> > > guilty job, defers the free_job to the timed-out job so it doesn’t
> > > disappear. Like WTF?
> > > 
> > > Oh, it gets better—amdgpu_device_sched_resume calls drm_sched_start,
> > > which iterates over the pending list and reinserts the same CB that
> > > drm_sched_stop removed, then starts the scheduler. So the guilty job had
> > > its CB successfully removed, and now it can immediately disappear—also
> > > like WTF?
> > > 
> > > Free_guilty is clearly a hack around the job not being reference
> > > counted, and it doesn’t even work in some cases. Putting that
> > > aside, I think calling free_job shouldn’t really ever happen in TDR.
> > > Imagine if drm_sched_job_timedout just took a reference to the job like
> > > normal kernel code—free_guilty could be dropped, and immediately this
> > > all becomes safe. Likewise, if the run_job work item had a reference to
> > > the job, which it takes before adding to the pending list and drops
> > > after it’s done touching it in this function, then run_job and free_job
> > > work items could safely execute in parallel rather than relying on an
> > > ordered workqueue to keep that part of the code safe.
> > 
> > I can tell you how I design it in our Rust jobqueue:
> > Drivers create jobs, and in submit_job() the pass ownership over the
> > job to the jobqueue – IOW after pushing a job, a driver can't access it
> > anymore. In the run_job() callback, the jobqueue either passes the job
> > back by value (ownership) or borrows the job to the driver so that it
> > can be copied (this is done so that the JQ can hypothetically do
> > resubmits).
> > 
> > This way there is no need for refcounting (in Rust / jobqueue).
> > 
> > Maybe the core of the problem is not so much the lack of refcounting,
> > but the lack of ownership rules. Why even would the driver need the job
> > still after it got pushed? It should be fire-and-forget.
> 
> Yeah, that sounds sane to me as well and is exactly how it was initially 
> designed in the drm_scheduler as well.
> 
> The job is basically just the information the driver needs for the submission 
> which it gives to the scheduler on push, and the scheduler gives back to the 
> driver on pop.
> 
> The full execution time is represented by the scheduler fence and not the 
> job. And the scheduler fence is reference counted exactly because of the 
> reasons Mathew brought up here.

Would be interesting to hear where Xe would still need the job. If only
the backend_ops give a driver access to a job again after it got
pushed, then it should be safe.

> 
> I'm absolutely not against reference counting, what I'm pushing back is 
> abusing the job object as something it was never designed for while we 
> already have an object which implements exactly the needed functionality.
> > > > > And my uneducated guess is that it's happening in amdgpu. It seems a
> > > > > sched_job lives inside an amdgpu_job. Can the latter be freed at other
> > > > > places than free_job()?
> > > > 
> > > > > > 
> > 
> > […]
> > 
> > > > > > It basically says to the driver that the job lifetime problems 
> > > > > > created by the scheduler is the driver problem and need to be 
> > > > > > worked around there.
> > > > > > 
> > > > > 
> > > > > My POV still mostly is that (with the current design) the driver must
> > > > > not use jobs after free_job() was invoked. And when that happens is
> > > > > unpredictable.
> > > > > 
> > > 
> > > This is somewhat of an absurd statement from my point of view. I have a
> > > valid job pointer, then I call another function (see above for an
> > > example of how drm_sched_start/stop is unsafe) and it disappears behind
> > > my back.
> > > 
> > 
> > The statement is absurd because reality (the code) is absurd. We all
> > are basically Alice in Wonderland, running as fast as we can just to
> > remain on the same spot ^_^
> > 
> > What I am stating is not that this is *good*, but this is what it
> > currently is like. Whether we like it or not.
> > 
> > The misunderstanding you and I might have is that for me jobs having to
> > be refcounted is not a reality until it's reflected in code,
> > documentation and, ideally, drivers.
> > 
> > >  The safe way to handle this is to take a local reference before
> > > doing anything that could make it disappear. That is much more
> > > reasonable than saying, “we have five things you can do in the
> > > scheduler, and if you do any of them it isn’t safe to touch the job
> > > afterward.”
> > 
> > Yeah, but that's drm_sched being drm_scheddy. Before I documented it
> > there were also these implicit refcounting rules in run_job(), where
> > the driver needs to take the reference for the scheduler for it to be
> > non-racy.
> > 
> > It also wasn't documented for a long time that drm_sched (through
> > spsc_queue) will explode if you don't use entities with a single
> > producer thread.
> 
> That is actually documented, but not on the scheduler but rather the 
> dma_fence.
> 
> And that you can only have a single producer is a requirement inherited from 
> the dma_fence and not scheduler specific at all.

What does dma_fence have to do with it? It's about the spsc_queue being
racy like mad. You can access and modify dma_fence's in parallel
however you want – they are refcounted and locked.


P.

> 
> > drm_sched got here because of gross design mistakes, lots of hacks for
> > few drivers, and, particularly, a strange aversion¹ against writing
> > documentation. If Xe came, back in the day, to the conclusion that job
> > lifetimes are fundamentally broken and that the objectively correct way
> > to solve this is refcounting, then why wasn't that pushed into
> > drm_sched back then?
> > 
> > > 
> > > > > To be unfair, we already have strange refcount expectations already.
> > > > > But I sort of agree that there is no objectively good solution in
> > > > > sight.
> > > > > 
> > > > > > > 
> > > > > > > Let me respin to my documentation series and upstream that 
> > > > > > > soonish,
> > > > > > > than we can build on top of that.
> > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > P.
> > > > > > > 
> > > > > > > > 
> > > > > > > > 2. Move reference counting to the base DRM scheduler job 
> > > > > > > > object, provide a
> > > > > > > >    vfunc for the final job put, and document usage. Medium 
> > > > > > > > invasive.
> > > > > > 
> > > > > > I strongly think that reference counting the job object just 
> > > > > > because the scheduler needs it is a bad idea.
> > > > > > 
> > > > > > With that we are just moving the hot potato from one side of our 
> > > > > > mouth to the other without really solving the issue.
> > > > > > 
> > > 
> > > See above—I can’t say I agree with this assessment. I think the lack of
> > > reference counting is exactly the problem. I don’t really understand the
> > > pushback on a very well-understood concept (reference counting) in
> > > Linux. I would sign up to fix the entire subsystem if we go this route.
> > > 
> > > > > > If a driver like XE needs that for some reason than that is 
> > > > > > perfectly fine.
> > > > > 
> > > > > Nouveau doesn't need it either.
> > > > > 
> > > > > > 
> > > > > > > > 3. Move job (driver) side tracking to the scheduler fence and 
> > > > > > > > let it
> > > > > > > >    control the lifetime. Very invasive.
> > > > > > 
> > > > > > Thinking about it more that is actually not so much of a problem.
> > > > > > 
> > > > > > Let me try to code something together by the end of next week or so.
> > > > > 
> > > > > The hero Gotham needs :)
> > > > > 
> > > 
> > > Are you sure about this one? I think unless the problems around
> > > drm_sched_start/stop and free_guilty are fixed, my feeling is this
> > > entire thing is still badly broken for anyone who wants to use those.
> > > 
> > > To sum up this whole email: I strongly disagree with option #3, but if
> > > that is the consensus, I will, of course, support the effort.
> > 
> > 
> > I would like to discuss those topics with Danilo, too, who returns from
> > LPC soonish. Also to get some more insights into Nouveau and our use-
> > cases.
> > 
> > My suggestion is that we pick the conversation up again soonish.
> > Christmas is around the corner, and I suppose we can't fix this all up
> > in 2025 anyways, so we might want to give it a fresh re-spin in '26.
> 
> Since we finally found the root cause I'm all in postponing that till next 
> year.
> 
> Christian.
> 
> > 
> > 
> > Greetings,
> > P.
> > 
> > 
> > 
> > [¹] The strangest aversion in drm_sched, however, is the one against
> > locking. Locks were only taken when *absolutely* necessary. It's as if
> > the entire code was designed by some gamers who want to show their
> > youtube subscribers that they can get 1fps more by changing RAM timings
> > in the bios.
> > drm_sched might be the component in the kernel using the most
> > synchronization mechanisms: Spinlocks, RCU, atomic integers, atomic
> > instructions, refcounting, READ_ONCE(), just accessing locklessly… even
> > Paul McKenney would get wide eyes here. The only thing we're still
> > missing is seqlocks and rw_locks, but maybe we can add those /s
> > 
> > That's likely sth we can't get repaired at all anymore.
>

Re: [PATCH] drm/sched: run free_job work on timeout workqueue

Reply via email to