On Tue, 2025-12-09 at 14:32 +0100, Christian König wrote:
> On 12/9/25 12:00, Philipp Stanner wrote:
> > On Mon, 2025-12-08 at 13:39 -0800, Matthew Brost wrote:
> > > On Mon, Dec 08, 2025 at 08:43:03PM +0100, Christian König wrote:
> > > > On 12/8/25 20:02, Matthew Brost wrote:
> > > > > On Mon, Dec 08, 2025 at 10:58:42AM -0800, Matthew Brost wrote:
> > > > > > On Mon, Dec 08, 2025 at 11:35:33AM +0100, Philipp Stanner wrote:
> > > > > > > On Fri, 2025-12-05 at 09:30 -0800, Matthew Brost wrote:
> > > > > > > > On Fri, Dec 05, 2025 at 11:18:21AM +0100, Christian König wrote:
> > > > > > > > > On 12/1/25 10:04, Philipp Stanner wrote:
> > > > ....
> > > > > > > > > > This entire fix idea seems to circle around the concept of 
> > > > > > > > > > relying yet
> > > > > > > > > > again on the scheduler's internal behavior (i.e., when it 
> > > > > > > > > > schedules the
> > > > > > > > > > call to free_job()).
> > > > > > > > > > 
> > > > > > > > > > I think we discussed that at XDC, but how I see it if 
> > > > > > > > > > drivers have
> > > > > > > > > > strange job life time requirements where a job shall outlive
> > > > > > > > > > drm_sched's free_job() call, they must solve that with a 
> > > > > > > > > > proper
> > > > > > > > > > synchronization mechanism.
> > > > > > > > > 
> > > > > > > > > Well that is not correct as far as I can see.
> > > > > > > > > 
> > > > > > > > > The problem here is rather that the scheduler gives the job 
> > > > > > > > > as parameter to the timedout_job() callback, but doesn't 
> > > > > > > > > guarantee that ->free_job() callback isn't called while 
> > > > > > > > > timedout_job() runs.
> > > > > > > > > 
> > > > > > > > > This should be prevented by removing the job in question from 
> > > > > > > > > the pending list (see drm_sched_job_timedout), but for some 
> > > > > > > > > reason that doesn't seem to work correctly.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Are you sure this is happening? It doesn’t seem possible, nor 
> > > > > > > > have I
> > > > > > > > observed it.
> > > > > > > 
> > > > > > > It's impossible, isn't it?
> > > > > > > 
> > > > > > > static void drm_sched_job_timedout(struct work_struct *work) { 
> > > > > > > struct drm_gpu_scheduler *sched; struct drm_sched_job *job; enum 
> > > > > > > drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_RESET; sched = 
> > > > > > > container_of(work, struct drm_gpu_scheduler, work_tdr.work); /* 
> > > > > > > Protects against concurrent deletion in 
> > > > > > > drm_sched_get_finished_job */ spin_lock(&sched->job_list_lock); 
> > > > > > > job = list_first_entry_or_null(&sched->pending_list, struct 
> > > > > > > drm_sched_job, list); if (job) { /* * Remove the bad job so it 
> > > > > > > cannot be freed by a concurrent * &struct 
> > > > > > > drm_sched_backend_ops.free_job. It will be * reinserted after the 
> > > > > > > scheduler's work items have been * cancelled, at which point it's 
> > > > > > > safe. */ list_del_init(&job->list); 
> > > > > > > spin_unlock(&sched->job_list_lock); status = 
> > > > > > > job->sched->ops->timedout_job(job);
> > > > > > > 
> > > > > > > 
> > > > > > >    1. scheduler takes list lock
> > > > > > >    2. removes job from list
> > > > > > >    3. unlocks
> > > > > > >    4. calls timedout_job callback
> > > > > > > 
> > > > > > > 
> > > > > > > How can free_job_work, through drm_sched_get_finished_job(), get 
> > > > > > > and
> > > > > > > free the same job?
> > > > > > > 
> > > > > > 
> > > > > > It can't.
> > > > 
> > > > But exactly that happens somehow. Don't ask me how, I have no idea.
> > 
> > *Philipp refuses to elaborate and asks Christian*
> > 
> > How are you so sure about that's what's happening? Anyways, assuming it
> > is true:
> 
> [  489.134585] 
> ==================================================================
> [  489.141949] BUG: KASAN: slab-use-after-free in 
> amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
> [  489.151339] Read of size 4 at addr ffff88a0d5f4214c by task 
> kworker/u128:0/12
> [  489.158686] 
> [  489.160277] CPU: 11 UID: 0 PID: 12 Comm: kworker/u128:0 Tainted: G         
>    E       6.16.0-1289896.3.zuul.0ec208edc00d48a9bae1719675cb777f #1 
> PREEMPT(voluntary) 
> [  489.160285] Tainted: [E]=UNSIGNED_MODULE
> [  489.160288] Hardware name: TYAN B8021G88V2HR-2T/S8021GM2NR-2T, BIOS 
> V1.03.B10 04/01/2019
> [  489.160292] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
> [  489.160306] Call Trace:
> [  489.160308]  <TASK>
> [  489.160311]  dump_stack_lvl+0x64/0x80
> [  489.160321]  print_report+0xce/0x630
> [  489.160328]  ? _raw_spin_lock_irqsave+0x86/0xd0
> [  489.160333]  ? __pfx__raw_spin_lock_irqsave+0x10/0x10
> [  489.160337]  ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
> [  489.161044]  kasan_report+0xb8/0xf0
> [  489.161049]  ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
> [  489.161756]  amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
> [  489.162464]  ? __pfx_amdgpu_device_gpu_recover+0x10/0x10 [amdgpu]
> [  489.163170]  ? amdgpu_coredump+0x1fd/0x4c0 [amdgpu]
> [  489.163904]  amdgpu_job_timedout+0x642/0x1400 [amdgpu]
> [  489.164698]  ? __pfx__raw_spin_lock+0x10/0x10
> [  489.164703]  ? __pfx_amdgpu_job_timedout+0x10/0x10 [amdgpu]
> [  489.165496]  ? _raw_spin_lock+0x75/0xc0
> [  489.165499]  ? __pfx__raw_spin_lock+0x10/0x10
> [  489.165503]  drm_sched_job_timedout+0x1b0/0x4b0 [gpu_sched]

That doesn't show that it's free_job() who freed the memory.

@Vitaly: Can you reproduce the bug? If yes, adding debug prints
printing the jobs' addresses when allocated and when freed in
free_job() could be a solution.

I repeat, we need more info :)

> 
> > 
> > > > 
> > > > My educated guess is that the job somehow ends up on the pending list 
> > > > again.
> > 
> > then the obvious question would be: does amdgpu touch the pending_list
> > itself, or does it only ever modify it through proper scheduler APIs?
> 
> My educated guess is that drm_sched_stop() inserted the job back into the 
> pending list, but I still have no idea how it is possible that free_job is 
> running after the scheduler is stopped.
> 

And my uneducated guess is that it's happening in amdgpu. It seems a
sched_job lives inside an amdgpu_job. Can the latter be freed at other
places than free_job()?

timedout_job() and free_job() cannot race against each other regarding
jobs. It's locked.

But maybe investigate Matthew's suggestion and look into the guilty
mechanism, too.

> > > > 
> > > > > > 
> > > > > > > The pending_list is probably the one place where we actually lock
> > > > > > > consistently and sanely.
> > > > > > > 
> > > > > > > I think this needs to be debugged more intensively, Christian.
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > What actually looks like a problem is that in 
> > > > > > > > drm_sched_job_timedout,
> > > > > > > > free_job can be called. Look at [2]—if you’re using free_guilty 
> > > > > > > > (Xe
> > > > > > > > isn’t, but [2] was Xe trying to do the same thing), this is 
> > > > > > > > actually
> > > > > > > > unsafe. The free_guilty code should likely be removed as that 
> > > > > > > > definitely
> > > > > > > > can explode under the right conditions.
> > > > > > > 
> > > > > > > I'm right now not even sure why free_guilty exists, but I don't 
> > > > > > > see how
> > > > > > 
> > > > > > I'm sure why free_guilty exists either. If the fence is signaled in
> > > > > > timedout job free_job will get scheduled on another work_item.
> > > > > > 
> > > > > > > it's illegal for drm_sched to call free_job in 
> > > > > > > drm_sched_job_timedout?
> > > > > > > 
> > > > > > > free_job can be called at any point in time, drivers must expect 
> > > > > > > that.
> > > > > > > No lock is being held, and timedout_job already ran. So what's the
> > > > > > > problem?
> > > > > > > 
> > > > > > > For drivers with additional refcounting it would be even less of a
> > > > > > > problem.
> > > > > > > 
> > > > > > 
> > > > > > No, the scheduler can still reference the job.
> > > > > > 
> > > > > > 1265         fence = sched->ops->run_job(sched_job);
> > > > > > 1266         complete_all(&entity->entity_idle);
> > > > > > 1267         drm_sched_fence_scheduled(s_fence, fence);
> > > > > > 1268
> > > > > > 1269         if (!IS_ERR_OR_NULL(fence)) {
> > > > > > 1270                 r = dma_fence_add_callback(fence, 
> > > > > > &sched_job->cb,
> > > > > > 1271                                            
> > > > > > drm_sched_job_done_cb);
> > > > > > 1272                 if (r == -ENOENT)
> > > > > > 1273                         drm_sched_job_done(sched_job, 
> > > > > > fence->error);
> > > > > > 1274                 else if (r)
> > > > > > 1275                         DRM_DEV_ERROR(sched->dev, "fence add 
> > > > > > callback failed (%d)\n", r);
> > > > > > 1276
> > > > > > 1277                 dma_fence_put(fence);
> > > > > > 1278         } else {
> > > > > > 1279                 drm_sched_job_done(sched_job, IS_ERR(fence) ?
> > > > > > 1280                                    PTR_ERR(fence) : 0);
> > > > > > 1281         }
> > > > > > 1282
> > > > > > 1283         wake_up(&sched->job_scheduled);
> > > > > > 1284         drm_sched_run_job_queue(sched);
> > > > > > 
> > > > > > At line 1269, the run_job work item is interrupted. Timed-out jobs 
> > > > > > run,
> > > > > > call free_job, which performs the final put. Then the run_job work 
> > > > > > item
> > > > > > resumes—and boom, UAF. Using the same reasoning, I think moving 
> > > > > > free_job
> > > > > > to the timed-out work queue could also cause issues.
> > > > > > 
> > > > > > If run_job work item took a reference to the job before adding it 
> > > > > > to the
> > > > > > pending list and dropped it after it was done touching it in this
> > > > > > function, then yes, that would be safe. This is an argument for 
> > > > > > moving
> > > > > > reference counting into the base DRM scheduler class, it would make
> > > > > 
> > > > > typo: s/DRM scheduler class/DRM job class
> > > > 
> > > > That strongly sounds like re-inventing the scheduler fence.
> > > > 
> > > 
> > > Perhaps.
> > > 
> > > > What if we completely drop the job object? Or merge it into the 
> > > > scheduler fence?
> > > > 
> > > > The fence has reference counting, proper state transitions and a well 
> > > > defined lifetime.
> > > > 
> > > > We would just need ->schedule and ->finished functions instead of 
> > > > ->run_job and ->free_job. Those callbacks would then still be called by 
> > > > the scheduler in work item context instead of the irq context of the 
> > > > dma_fence callbacks.
> > > 
> > > Yes, definitely no IRQ contexts.
> > > 
> > > > 
> > > > The job can then be a void* in the scheduler fence where drivers can 
> > > > put anything they want and also drivers control the lifetime of that. 
> > > > E.g. they can free it during ->schedule as well as during ->finished.
> > > > 
> > > 
> > > I think this is a reasonable idea, but it would require major surgery
> > > across the subsystem plus the 11 upstream drivers I’m counting that use
> > > DRM scheduler. This would be a huge coordinated effort.
> > > 
> > > So I see three options:
> > > 
> > > 1. Rename free_job to put_job and document usage. Rip out free_guilty.
> > >    Likely the easiest and least invasive.
> > 
> > I think I can live with that. It's work-effort wise the best we can do
> > right now. However, that does need proper documentation.
> 
> I think that is the worst of all options.
> 
> It basically says to the driver that the job lifetime problems created by the 
> scheduler is the driver problem and need to be worked around there.
> 

My POV still mostly is that (with the current design) the driver must
not use jobs after free_job() was invoked. And when that happens is
unpredictable.

To be unfair, we already have strange refcount expectations already.
But I sort of agree that there is no objectively good solution in
sight.

> > 
> > Let me respin to my documentation series and upstream that soonish,
> > than we can build on top of that.

> > 
> > 
> > P.
> > 
> > > 
> > > 2. Move reference counting to the base DRM scheduler job object, provide a
> > >    vfunc for the final job put, and document usage. Medium invasive.
> 
> I strongly think that reference counting the job object just because the 
> scheduler needs it is a bad idea.
> 
> With that we are just moving the hot potato from one side of our mouth to the 
> other without really solving the issue.
> 
> If a driver like XE needs that for some reason than that is perfectly fine.

Nouveau doesn't need it either.

> 
> > > 3. Move job (driver) side tracking to the scheduler fence and let it
> > >    control the lifetime. Very invasive.
> 
> Thinking about it more that is actually not so much of a problem.
> 
> Let me try to code something together by the end of next week or so.

The hero Gotham needs :)


P.

Reply via email to