On Wed, 2025-12-10 at 13:47 +0100, Christian König wrote: > On 12/10/25 10:58, Philipp Stanner wrote: > > On Tue, 2025-12-09 at 17:58 -0800, Matthew Brost wrote: > > > On Tue, Dec 09, 2025 at 03:19:40PM +0100, Christian König wrote: > .. > > > > > > My educated guess is that drm_sched_stop() inserted the job back > > > > > > into the pending list, but I still have no idea how it is possible > > > > > > that free_job is running after the scheduler is stopped. > > > > > > I believe I found your problem, referencing amdgpu/amdgpu_device.c here. > > > > > > 6718 if (job) > > > 6719 ti = amdgpu_vm_get_task_info_pasid(adev, > > > job->pasid); > > WTF! There it is! Thanks a lot for pointing that out!
scripts/decode_stacktrace.sh should be able to find the exact location of a UAF. Requires manual debugging with the kernel build tree at hands, though. So that's difficult in CIs. > > > > > > > Which is after: > > > > > > 6695 skip_hw_reset: > > > 6696 r = amdgpu_device_sched_resume(&device_list, reset_context, > > > job_signaled); > > > 6697 if (r) > > > 6698 goto reset_unlock; > > > > > > The job is likely added back into this free list here: > > > > > > 6676 amdgpu_device_halt_activities(adev, job, reset_context, > > > &device_list, > > > 6677 hive, need_emergency_restart); > > > > > > So free_job runs and 'job->pasid' explodes. > > I've read over that code like a hundred times and didn't realized that the > job is accessed after the scheduler resumes. > > > > > > > Save off the pasid on the stack at top of this function and I suspect > > > your UAF goes away. This won't untangle this hair ball of code but I > > > believe this at least prevent explosions. > > > > > > But let’s dig in further—amdgpu_device_halt_activities calls > > > drm_sched_stop (Xe just calls drm_sched_wqueue_stop for reference). This > > > function stops the work items, then adds the offending job back to the > > > pending list, iterates over each job, removes the CB, leaving the job in > > > the pending list. If the CB can be removed, it removes the job from > > > pending, maybe calls free_job if it’s not a guilty job, and if it is a > > > guilty job, defers the free_job to the timed-out job so it doesn’t > > > disappear. Like WTF? > > > > > > Oh, it gets better—amdgpu_device_sched_resume calls drm_sched_start, > > > which iterates over the pending list and reinserts the same CB that > > > drm_sched_stop removed, then starts the scheduler. So the guilty job had > > > its CB successfully removed, and now it can immediately disappear—also > > > like WTF? > > > > > > Free_guilty is clearly a hack around the job not being reference > > > counted, and it doesn’t even work in some cases. Putting that > > > aside, I think calling free_job shouldn’t really ever happen in TDR. > > > Imagine if drm_sched_job_timedout just took a reference to the job like > > > normal kernel code—free_guilty could be dropped, and immediately this > > > all becomes safe. Likewise, if the run_job work item had a reference to > > > the job, which it takes before adding to the pending list and drops > > > after it’s done touching it in this function, then run_job and free_job > > > work items could safely execute in parallel rather than relying on an > > > ordered workqueue to keep that part of the code safe. > > > > I can tell you how I design it in our Rust jobqueue: > > Drivers create jobs, and in submit_job() the pass ownership over the > > job to the jobqueue – IOW after pushing a job, a driver can't access it > > anymore. In the run_job() callback, the jobqueue either passes the job > > back by value (ownership) or borrows the job to the driver so that it > > can be copied (this is done so that the JQ can hypothetically do > > resubmits). > > > > This way there is no need for refcounting (in Rust / jobqueue). > > > > Maybe the core of the problem is not so much the lack of refcounting, > > but the lack of ownership rules. Why even would the driver need the job > > still after it got pushed? It should be fire-and-forget. > > Yeah, that sounds sane to me as well and is exactly how it was initially > designed in the drm_scheduler as well. > > The job is basically just the information the driver needs for the submission > which it gives to the scheduler on push, and the scheduler gives back to the > driver on pop. > > The full execution time is represented by the scheduler fence and not the > job. And the scheduler fence is reference counted exactly because of the > reasons Mathew brought up here. Would be interesting to hear where Xe would still need the job. If only the backend_ops give a driver access to a job again after it got pushed, then it should be safe. > > I'm absolutely not against reference counting, what I'm pushing back is > abusing the job object as something it was never designed for while we > already have an object which implements exactly the needed functionality. > > > > > And my uneducated guess is that it's happening in amdgpu. It seems a > > > > > sched_job lives inside an amdgpu_job. Can the latter be freed at other > > > > > places than free_job()? > > > > > > > > > > > > > > […] > > > > > > > > It basically says to the driver that the job lifetime problems > > > > > > created by the scheduler is the driver problem and need to be > > > > > > worked around there. > > > > > > > > > > > > > > > > My POV still mostly is that (with the current design) the driver must > > > > > not use jobs after free_job() was invoked. And when that happens is > > > > > unpredictable. > > > > > > > > > > > This is somewhat of an absurd statement from my point of view. I have a > > > valid job pointer, then I call another function (see above for an > > > example of how drm_sched_start/stop is unsafe) and it disappears behind > > > my back. > > > > > > > The statement is absurd because reality (the code) is absurd. We all > > are basically Alice in Wonderland, running as fast as we can just to > > remain on the same spot ^_^ > > > > What I am stating is not that this is *good*, but this is what it > > currently is like. Whether we like it or not. > > > > The misunderstanding you and I might have is that for me jobs having to > > be refcounted is not a reality until it's reflected in code, > > documentation and, ideally, drivers. > > > > > The safe way to handle this is to take a local reference before > > > doing anything that could make it disappear. That is much more > > > reasonable than saying, “we have five things you can do in the > > > scheduler, and if you do any of them it isn’t safe to touch the job > > > afterward.” > > > > Yeah, but that's drm_sched being drm_scheddy. Before I documented it > > there were also these implicit refcounting rules in run_job(), where > > the driver needs to take the reference for the scheduler for it to be > > non-racy. > > > > It also wasn't documented for a long time that drm_sched (through > > spsc_queue) will explode if you don't use entities with a single > > producer thread. > > That is actually documented, but not on the scheduler but rather the > dma_fence. > > And that you can only have a single producer is a requirement inherited from > the dma_fence and not scheduler specific at all. What does dma_fence have to do with it? It's about the spsc_queue being racy like mad. You can access and modify dma_fence's in parallel however you want – they are refcounted and locked. P. > > > drm_sched got here because of gross design mistakes, lots of hacks for > > few drivers, and, particularly, a strange aversion¹ against writing > > documentation. If Xe came, back in the day, to the conclusion that job > > lifetimes are fundamentally broken and that the objectively correct way > > to solve this is refcounting, then why wasn't that pushed into > > drm_sched back then? > > > > > > > > > > To be unfair, we already have strange refcount expectations already. > > > > > But I sort of agree that there is no objectively good solution in > > > > > sight. > > > > > > > > > > > > > > > > > > > Let me respin to my documentation series and upstream that > > > > > > > soonish, > > > > > > > than we can build on top of that. > > > > > > > > > > > > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > > > > > > > 2. Move reference counting to the base DRM scheduler job > > > > > > > > object, provide a > > > > > > > > vfunc for the final job put, and document usage. Medium > > > > > > > > invasive. > > > > > > > > > > > > I strongly think that reference counting the job object just > > > > > > because the scheduler needs it is a bad idea. > > > > > > > > > > > > With that we are just moving the hot potato from one side of our > > > > > > mouth to the other without really solving the issue. > > > > > > > > > > > > See above—I can’t say I agree with this assessment. I think the lack of > > > reference counting is exactly the problem. I don’t really understand the > > > pushback on a very well-understood concept (reference counting) in > > > Linux. I would sign up to fix the entire subsystem if we go this route. > > > > > > > > > If a driver like XE needs that for some reason than that is > > > > > > perfectly fine. > > > > > > > > > > Nouveau doesn't need it either. > > > > > > > > > > > > > > > > > > > 3. Move job (driver) side tracking to the scheduler fence and > > > > > > > > let it > > > > > > > > control the lifetime. Very invasive. > > > > > > > > > > > > Thinking about it more that is actually not so much of a problem. > > > > > > > > > > > > Let me try to code something together by the end of next week or so. > > > > > > > > > > The hero Gotham needs :) > > > > > > > > > > > Are you sure about this one? I think unless the problems around > > > drm_sched_start/stop and free_guilty are fixed, my feeling is this > > > entire thing is still badly broken for anyone who wants to use those. > > > > > > To sum up this whole email: I strongly disagree with option #3, but if > > > that is the consensus, I will, of course, support the effort. > > > > > > I would like to discuss those topics with Danilo, too, who returns from > > LPC soonish. Also to get some more insights into Nouveau and our use- > > cases. > > > > My suggestion is that we pick the conversation up again soonish. > > Christmas is around the corner, and I suppose we can't fix this all up > > in 2025 anyways, so we might want to give it a fresh re-spin in '26. > > Since we finally found the root cause I'm all in postponing that till next > year. > > Christian. > > > > > > > Greetings, > > P. > > > > > > > > [¹] The strangest aversion in drm_sched, however, is the one against > > locking. Locks were only taken when *absolutely* necessary. It's as if > > the entire code was designed by some gamers who want to show their > > youtube subscribers that they can get 1fps more by changing RAM timings > > in the bios. > > drm_sched might be the component in the kernel using the most > > synchronization mechanisms: Spinlocks, RCU, atomic integers, atomic > > instructions, refcounting, READ_ONCE(), just accessing locklessly… even > > Paul McKenney would get wide eyes here. The only thing we're still > > missing is seqlocks and rw_locks, but maybe we can add those /s > > > > That's likely sth we can't get repaired at all anymore. >
