kfd_is_locked remains locked, if the guilty job fence signals during the
reset sequence. In this scenario, hw_reset is skipped and
amdgpu_device_reset_sriov() which calls amdgpu_amdkfd_post_reset()
doesn't get called.

In bare metal, amdgpu_device_gpu_resume() calls amdgpu_amdkfd_post_reset()

Call amdgpu_amdkfd_post_reset() under this condition

Signed-off-by: Harish Kasiviswanathan <[email protected]>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index dc8c650fc341..cefe1e5dd946 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5888,6 +5888,16 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
        if (r)
                goto reset_unlock;
 skip_hw_reset:
+       /*
+        * For VF, gpu_resume skips amdgpu_amdkfd_post_reset (normally done
+        * inside amdgpu_device_reset_sriov during actual HW reset). Since HW
+        * reset was skipped, we must unlock KFD here to undo the kfd_locked++
+        * from pre_reset, otherwise KFD stays locked permanently and new
+        * process creation fails with "KFD is locked".
+        */
+       if (job_signaled && amdgpu_sriov_vf(adev))
+               amdgpu_amdkfd_post_reset(adev);
+
        r = amdgpu_device_sched_resume(&device_list, reset_context, 
job_signaled);
        if (r)
                goto reset_unlock;
-- 
2.43.0

Reply via email to