kfd_is_locked remains locked, if the guilty job fence signals during the reset sequence. In this scenario, hw_reset is skipped and amdgpu_device_reset_sriov() which calls amdgpu_amdkfd_post_reset() doesn't get called.
In bare metal, amdgpu_device_gpu_resume() calls amdgpu_amdkfd_post_reset() Call amdgpu_amdkfd_post_reset() under this condition Signed-off-by: Harish Kasiviswanathan <[email protected]> --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index dc8c650fc341..cefe1e5dd946 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5888,6 +5888,16 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, if (r) goto reset_unlock; skip_hw_reset: + /* + * For VF, gpu_resume skips amdgpu_amdkfd_post_reset (normally done + * inside amdgpu_device_reset_sriov during actual HW reset). Since HW + * reset was skipped, we must unlock KFD here to undo the kfd_locked++ + * from pre_reset, otherwise KFD stays locked permanently and new + * process creation fails with "KFD is locked". + */ + if (job_signaled && amdgpu_sriov_vf(adev)) + amdgpu_amdkfd_post_reset(adev); + r = amdgpu_device_sched_resume(&device_list, reset_context, job_signaled); if (r) goto reset_unlock; -- 2.43.0
