On Mon, Jun 8, 2026 at 3:09 PM Harish Kasiviswanathan <[email protected]> wrote: > > kfd_is_locked remains locked, if the guilty job fence signals during the > reset sequence. In this scenario, hw_reset is skipped and > amdgpu_device_reset_sriov() which calls amdgpu_amdkfd_post_reset() > doesn't get called. > > In bare metal, amdgpu_device_gpu_resume() calls amdgpu_amdkfd_post_reset() > > Call amdgpu_amdkfd_post_reset() under this condition > > Signed-off-by: Harish Kasiviswanathan <[email protected]>
Acked-by: Alex Deucher <[email protected]> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index dc8c650fc341..cefe1e5dd946 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -5888,6 +5888,16 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > *adev, > if (r) > goto reset_unlock; > skip_hw_reset: > + /* > + * For VF, gpu_resume skips amdgpu_amdkfd_post_reset (normally done > + * inside amdgpu_device_reset_sriov during actual HW reset). Since HW > + * reset was skipped, we must unlock KFD here to undo the kfd_locked++ > + * from pre_reset, otherwise KFD stays locked permanently and new > + * process creation fails with "KFD is locked". > + */ > + if (job_signaled && amdgpu_sriov_vf(adev)) > + amdgpu_amdkfd_post_reset(adev); > + > r = amdgpu_device_sched_resume(&device_list, reset_context, > job_signaled); > if (r) > goto reset_unlock; > -- > 2.43.0 >
