[AMD Official Use Only - AMD Internal Distribution Only] > -----Original Message----- > From: Lazar, Lijo <[email protected]> > Sent: Thursday, December 11, 2025 10:34 AM > To: Deucher, Alexander <[email protected]>; amd- > [email protected] > Cc: SHANMUGAM, SRINIVASAN <[email protected]>; > Prosyak, Vitaly <[email protected]>; Koenig, Christian > <[email protected]>; Matthew Brost <[email protected]> > Subject: Re: [PATCH V2] drm/amdgpu: fix a job->pasid access race in gpu > recovery > > > > On 12/11/2025 1:53 AM, Alex Deucher wrote: > > Avoid a possible UAF in GPU recovery due to a race between the sched > > timeout callback and the tdr work queue. > > > > The gpu recovery function calls drm_sched_stop() and later > > drm_sched_start(). drm_sched_start() restarts the tdr queue which > > will eventually free the job. If the tdr queue frees the job before > > time out callback completes, the job will be freed and we'll get a UAF > > when accessing the pasid. Cache it early to avoid the UAF. > > > > Fixes: a72002cb181f ("drm/amdgpu: Make use of drm_wedge_task_info") > > Cc: [email protected] > > Cc: [email protected] > > Cc: [email protected] > > Suggested-by: Matthew Brost <[email protected]> > > Signed-off-by: Alex Deucher <[email protected]> > > --- > > > > v2: Check the pasid rather than job (Lijo) > > Add fixes tag (Christian) > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++++++++-- > > 1 file changed, 8 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index 8a851d7548c00..c6b1dd95c401d 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -6634,6 +6634,8 @@ int amdgpu_device_gpu_recover(struct > amdgpu_device *adev, > > struct amdgpu_hive_info *hive = NULL; > > int r = 0; > > bool need_emergency_restart = false; > > + /* save the pasid here as the job may be freed before the end of the > > reset */ > > + int pasid = job ? job->pasid : -EINVAL; > > > > /* > > * If it reaches here because of hang/timeout and a RAS error is @@ > > -6734,8 +6736,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > *adev, > > if (!r) { > > struct amdgpu_task_info *ti = NULL; > > > > - if (job) > > - ti = amdgpu_vm_get_task_info_pasid(adev, job->pasid); > > + /* > > + * The job may already be freed at this point via the sched tdr > workqueue so > > + * use the cached pasid. > > + */ > > amdgpu_device_gpu_recover() is run in tdr workqueue. > > Now if this is the case, someone has to explain the logic - > > Timeout is triggered here - > https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/scheduler/sched_main > .c#L559 > > This calls amdgpu_job_timedout() -> amdgpu_device_gpu_recover() > > After that, there is this access to the job - > > https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/scheduler/sched_main > .c#L566 > > At least, in some condition, job is not expected to be freed. Then I'm not > sure if this > is the right fix.
What is that "someone", "some condition" you feel like? Its better to bring proper justification, and take up this as separate refactoring task Best, Srini
