On Tue, 2025-12-09 at 14:32 +0100, Christian König wrote:
> On 12/9/25 12:00, Philipp Stanner wrote:
> > On Mon, 2025-12-08 at 13:39 -0800, Matthew Brost wrote:
> > > On Mon, Dec 08, 2025 at 08:43:03PM +0100, Christian König wrote:
> > > > On 12/8/25 20:02, Matthew Brost wrote:
> > > > > On Mon, Dec 08, 2025 at 10:58:42AM -0800, Matthew Brost wrote:
> > > > > > On Mon, Dec 08, 2025 at 11:35:33AM +0100, Philipp Stanner wrote:
> > > > > > > On Fri, 2025-12-05 at 09:30 -0800, Matthew Brost wrote:
> > > > > > > > On Fri, Dec 05, 2025 at 11:18:21AM +0100, Christian König wrote:
> > > > > > > > > On 12/1/25 10:04, Philipp Stanner wrote:
> > > > ....
> > > > > > > > > > This entire fix idea seems to circle around the concept of
> > > > > > > > > > relying yet
> > > > > > > > > > again on the scheduler's internal behavior (i.e., when it
> > > > > > > > > > schedules the
> > > > > > > > > > call to free_job()).
> > > > > > > > > >
> > > > > > > > > > I think we discussed that at XDC, but how I see it if
> > > > > > > > > > drivers have
> > > > > > > > > > strange job life time requirements where a job shall outlive
> > > > > > > > > > drm_sched's free_job() call, they must solve that with a
> > > > > > > > > > proper
> > > > > > > > > > synchronization mechanism.
> > > > > > > > >
> > > > > > > > > Well that is not correct as far as I can see.
> > > > > > > > >
> > > > > > > > > The problem here is rather that the scheduler gives the job
> > > > > > > > > as parameter to the timedout_job() callback, but doesn't
> > > > > > > > > guarantee that ->free_job() callback isn't called while
> > > > > > > > > timedout_job() runs.
> > > > > > > > >
> > > > > > > > > This should be prevented by removing the job in question from
> > > > > > > > > the pending list (see drm_sched_job_timedout), but for some
> > > > > > > > > reason that doesn't seem to work correctly.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Are you sure this is happening? It doesn’t seem possible, nor
> > > > > > > > have I
> > > > > > > > observed it.
> > > > > > >
> > > > > > > It's impossible, isn't it?
> > > > > > >
> > > > > > > static void drm_sched_job_timedout(struct work_struct *work) {
> > > > > > > struct drm_gpu_scheduler *sched; struct drm_sched_job *job; enum
> > > > > > > drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_RESET; sched =
> > > > > > > container_of(work, struct drm_gpu_scheduler, work_tdr.work); /*
> > > > > > > Protects against concurrent deletion in
> > > > > > > drm_sched_get_finished_job */ spin_lock(&sched->job_list_lock);
> > > > > > > job = list_first_entry_or_null(&sched->pending_list, struct
> > > > > > > drm_sched_job, list); if (job) { /* * Remove the bad job so it
> > > > > > > cannot be freed by a concurrent * &struct
> > > > > > > drm_sched_backend_ops.free_job. It will be * reinserted after the
> > > > > > > scheduler's work items have been * cancelled, at which point it's
> > > > > > > safe. */ list_del_init(&job->list);
> > > > > > > spin_unlock(&sched->job_list_lock); status =
> > > > > > > job->sched->ops->timedout_job(job);
> > > > > > >
> > > > > > >
> > > > > > > 1. scheduler takes list lock
> > > > > > > 2. removes job from list
> > > > > > > 3. unlocks
> > > > > > > 4. calls timedout_job callback
> > > > > > >
> > > > > > >
> > > > > > > How can free_job_work, through drm_sched_get_finished_job(), get
> > > > > > > and
> > > > > > > free the same job?
> > > > > > >
> > > > > >
> > > > > > It can't.
> > > >
> > > > But exactly that happens somehow. Don't ask me how, I have no idea.
> >
> > *Philipp refuses to elaborate and asks Christian*
> >
> > How are you so sure about that's what's happening? Anyways, assuming it
> > is true:
>
> [ 489.134585]
> ==================================================================
> [ 489.141949] BUG: KASAN: slab-use-after-free in
> amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
> [ 489.151339] Read of size 4 at addr ffff88a0d5f4214c by task
> kworker/u128:0/12
> [ 489.158686]
> [ 489.160277] CPU: 11 UID: 0 PID: 12 Comm: kworker/u128:0 Tainted: G
> E 6.16.0-1289896.3.zuul.0ec208edc00d48a9bae1719675cb777f #1
> PREEMPT(voluntary)
> [ 489.160285] Tainted: [E]=UNSIGNED_MODULE
> [ 489.160288] Hardware name: TYAN B8021G88V2HR-2T/S8021GM2NR-2T, BIOS
> V1.03.B10 04/01/2019
> [ 489.160292] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
> [ 489.160306] Call Trace:
> [ 489.160308] <TASK>
> [ 489.160311] dump_stack_lvl+0x64/0x80
> [ 489.160321] print_report+0xce/0x630
> [ 489.160328] ? _raw_spin_lock_irqsave+0x86/0xd0
> [ 489.160333] ? __pfx__raw_spin_lock_irqsave+0x10/0x10
> [ 489.160337] ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
> [ 489.161044] kasan_report+0xb8/0xf0
> [ 489.161049] ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
> [ 489.161756] amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
> [ 489.162464] ? __pfx_amdgpu_device_gpu_recover+0x10/0x10 [amdgpu]
> [ 489.163170] ? amdgpu_coredump+0x1fd/0x4c0 [amdgpu]
> [ 489.163904] amdgpu_job_timedout+0x642/0x1400 [amdgpu]
> [ 489.164698] ? __pfx__raw_spin_lock+0x10/0x10
> [ 489.164703] ? __pfx_amdgpu_job_timedout+0x10/0x10 [amdgpu]
> [ 489.165496] ? _raw_spin_lock+0x75/0xc0
> [ 489.165499] ? __pfx__raw_spin_lock+0x10/0x10
> [ 489.165503] drm_sched_job_timedout+0x1b0/0x4b0 [gpu_sched]
That doesn't show that it's free_job() who freed the memory.
@Vitaly: Can you reproduce the bug? If yes, adding debug prints
printing the jobs' addresses when allocated and when freed in
free_job() could be a solution.
I repeat, we need more info :)
>
> >
> > > >
> > > > My educated guess is that the job somehow ends up on the pending list
> > > > again.
> >
> > then the obvious question would be: does amdgpu touch the pending_list
> > itself, or does it only ever modify it through proper scheduler APIs?
>
> My educated guess is that drm_sched_stop() inserted the job back into the
> pending list, but I still have no idea how it is possible that free_job is
> running after the scheduler is stopped.
>
And my uneducated guess is that it's happening in amdgpu. It seems a
sched_job lives inside an amdgpu_job. Can the latter be freed at other
places than free_job()?
timedout_job() and free_job() cannot race against each other regarding
jobs. It's locked.
But maybe investigate Matthew's suggestion and look into the guilty
mechanism, too.
> > > >
> > > > > >
> > > > > > > The pending_list is probably the one place where we actually lock
> > > > > > > consistently and sanely.
> > > > > > >
> > > > > > > I think this needs to be debugged more intensively, Christian.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > What actually looks like a problem is that in
> > > > > > > > drm_sched_job_timedout,
> > > > > > > > free_job can be called. Look at [2]—if you’re using free_guilty
> > > > > > > > (Xe
> > > > > > > > isn’t, but [2] was Xe trying to do the same thing), this is
> > > > > > > > actually
> > > > > > > > unsafe. The free_guilty code should likely be removed as that
> > > > > > > > definitely
> > > > > > > > can explode under the right conditions.
> > > > > > >
> > > > > > > I'm right now not even sure why free_guilty exists, but I don't
> > > > > > > see how
> > > > > >
> > > > > > I'm sure why free_guilty exists either. If the fence is signaled in
> > > > > > timedout job free_job will get scheduled on another work_item.
> > > > > >
> > > > > > > it's illegal for drm_sched to call free_job in
> > > > > > > drm_sched_job_timedout?
> > > > > > >
> > > > > > > free_job can be called at any point in time, drivers must expect
> > > > > > > that.
> > > > > > > No lock is being held, and timedout_job already ran. So what's the
> > > > > > > problem?
> > > > > > >
> > > > > > > For drivers with additional refcounting it would be even less of a
> > > > > > > problem.
> > > > > > >
> > > > > >
> > > > > > No, the scheduler can still reference the job.
> > > > > >
> > > > > > 1265 fence = sched->ops->run_job(sched_job);
> > > > > > 1266 complete_all(&entity->entity_idle);
> > > > > > 1267 drm_sched_fence_scheduled(s_fence, fence);
> > > > > > 1268
> > > > > > 1269 if (!IS_ERR_OR_NULL(fence)) {
> > > > > > 1270 r = dma_fence_add_callback(fence,
> > > > > > &sched_job->cb,
> > > > > > 1271
> > > > > > drm_sched_job_done_cb);
> > > > > > 1272 if (r == -ENOENT)
> > > > > > 1273 drm_sched_job_done(sched_job,
> > > > > > fence->error);
> > > > > > 1274 else if (r)
> > > > > > 1275 DRM_DEV_ERROR(sched->dev, "fence add
> > > > > > callback failed (%d)\n", r);
> > > > > > 1276
> > > > > > 1277 dma_fence_put(fence);
> > > > > > 1278 } else {
> > > > > > 1279 drm_sched_job_done(sched_job, IS_ERR(fence) ?
> > > > > > 1280 PTR_ERR(fence) : 0);
> > > > > > 1281 }
> > > > > > 1282
> > > > > > 1283 wake_up(&sched->job_scheduled);
> > > > > > 1284 drm_sched_run_job_queue(sched);
> > > > > >
> > > > > > At line 1269, the run_job work item is interrupted. Timed-out jobs
> > > > > > run,
> > > > > > call free_job, which performs the final put. Then the run_job work
> > > > > > item
> > > > > > resumes—and boom, UAF. Using the same reasoning, I think moving
> > > > > > free_job
> > > > > > to the timed-out work queue could also cause issues.
> > > > > >
> > > > > > If run_job work item took a reference to the job before adding it
> > > > > > to the
> > > > > > pending list and dropped it after it was done touching it in this
> > > > > > function, then yes, that would be safe. This is an argument for
> > > > > > moving
> > > > > > reference counting into the base DRM scheduler class, it would make
> > > > >
> > > > > typo: s/DRM scheduler class/DRM job class
> > > >
> > > > That strongly sounds like re-inventing the scheduler fence.
> > > >
> > >
> > > Perhaps.
> > >
> > > > What if we completely drop the job object? Or merge it into the
> > > > scheduler fence?
> > > >
> > > > The fence has reference counting, proper state transitions and a well
> > > > defined lifetime.
> > > >
> > > > We would just need ->schedule and ->finished functions instead of
> > > > ->run_job and ->free_job. Those callbacks would then still be called by
> > > > the scheduler in work item context instead of the irq context of the
> > > > dma_fence callbacks.
> > >
> > > Yes, definitely no IRQ contexts.
> > >
> > > >
> > > > The job can then be a void* in the scheduler fence where drivers can
> > > > put anything they want and also drivers control the lifetime of that.
> > > > E.g. they can free it during ->schedule as well as during ->finished.
> > > >
> > >
> > > I think this is a reasonable idea, but it would require major surgery
> > > across the subsystem plus the 11 upstream drivers I’m counting that use
> > > DRM scheduler. This would be a huge coordinated effort.
> > >
> > > So I see three options:
> > >
> > > 1. Rename free_job to put_job and document usage. Rip out free_guilty.
> > > Likely the easiest and least invasive.
> >
> > I think I can live with that. It's work-effort wise the best we can do
> > right now. However, that does need proper documentation.
>
> I think that is the worst of all options.
>
> It basically says to the driver that the job lifetime problems created by the
> scheduler is the driver problem and need to be worked around there.
>
My POV still mostly is that (with the current design) the driver must
not use jobs after free_job() was invoked. And when that happens is
unpredictable.
To be unfair, we already have strange refcount expectations already.
But I sort of agree that there is no objectively good solution in
sight.
> >
> > Let me respin to my documentation series and upstream that soonish,
> > than we can build on top of that.
> >
> >
> > P.
> >
> > >
> > > 2. Move reference counting to the base DRM scheduler job object, provide a
> > > vfunc for the final job put, and document usage. Medium invasive.
>
> I strongly think that reference counting the job object just because the
> scheduler needs it is a bad idea.
>
> With that we are just moving the hot potato from one side of our mouth to the
> other without really solving the issue.
>
> If a driver like XE needs that for some reason than that is perfectly fine.
Nouveau doesn't need it either.
>
> > > 3. Move job (driver) side tracking to the scheduler fence and let it
> > > control the lifetime. Very invasive.
>
> Thinking about it more that is actually not so much of a problem.
>
> Let me try to code something together by the end of next week or so.
The hero Gotham needs :)
P.