On Mon, Jun 8, 2026 at 3:09 PM Harish Kasiviswanathan
<[email protected]> wrote:
>
> kfd_is_locked remains locked, if the guilty job fence signals during the
> reset sequence. In this scenario, hw_reset is skipped and
> amdgpu_device_reset_sriov() which calls amdgpu_amdkfd_post_reset()
> doesn't get called.
>
> In bare metal, amdgpu_device_gpu_resume() calls amdgpu_amdkfd_post_reset()
>
> Call amdgpu_amdkfd_post_reset() under this condition
>
> Signed-off-by: Harish Kasiviswanathan <[email protected]>

Acked-by: Alex Deucher <[email protected]>

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index dc8c650fc341..cefe1e5dd946 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5888,6 +5888,16 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>         if (r)
>                 goto reset_unlock;
>  skip_hw_reset:
> +       /*
> +        * For VF, gpu_resume skips amdgpu_amdkfd_post_reset (normally done
> +        * inside amdgpu_device_reset_sriov during actual HW reset). Since HW
> +        * reset was skipped, we must unlock KFD here to undo the kfd_locked++
> +        * from pre_reset, otherwise KFD stays locked permanently and new
> +        * process creation fails with "KFD is locked".
> +        */
> +       if (job_signaled && amdgpu_sriov_vf(adev))
> +               amdgpu_amdkfd_post_reset(adev);
> +
>         r = amdgpu_device_sched_resume(&device_list, reset_context, 
> job_signaled);
>         if (r)
>                 goto reset_unlock;
> --
> 2.43.0
>

Reply via email to