Re: [PATCH] drm/amdgpu: report bad status in GPU recovery

Lazar, Lijo Wed, 31 Jul 2024 06:31:24 -0700

On 7/31/2024 3:35 PM, Tao Zhou wrote:
> Instead of printing GPU reset failed.
> 
> Signed-off-by: Tao Zhou <[email protected]>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 355c2478c4b6..b7c967779b4b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5933,8 +5933,13 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>               tmp_adev->asic_reset_res = 0;
>  
>               if (r) {
> -                     /* bad news, how to tell it to userspace ? */
> -                     dev_info(tmp_adev->dev, "GPU reset(%d) failed\n", 
> atomic_read(&tmp_adev->gpu_reset_counter));
> +                     /* bad news, how to tell it to userspace ?
> +                      * for ras error, we should report GPU bad status 
> instead of
> +                      * reset failure
> +                      */
> +                     if (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
> +                             dev_info(tmp_adev->dev, "GPU reset(%d) 
> failed\n",
> +                                     
> atomic_read(&tmp_adev->gpu_reset_counter));

Better to check reset_context.src == AMDGPU_RESET_SRC_RAS to confirm
that the reset is indeed triggered due to ras error.

Thanks,
Lijo

>                       amdgpu_vf_error_put(tmp_adev, 
> AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
>               } else {
>                       dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", 
> atomic_read(&tmp_adev->gpu_reset_counter));
Re: [PATCH] drm/amdgpu: report bad status in GPU recovery

Reply via email to