amdgpu: add RAS page retirement functions for MCA

Zhang, Hawking Thu, 20 Oct 2022 21:15:31 -0700

[AMD Official Use Only - General]

Re - whether need to do gpu reset is determined by unmap queue status, so reset 
parameter can't be dropped


+       if (adev->gmc.xgmi.connected_to_cpu) {
+               ret = amdgpu_umc_poison_handler_mca(adev, ras_error_status, 
reset);

I think amdgpu_umc_poison_handler_mca is fallback handler specific for MCA 
platform, right? 

I noticed there is platform check already in amdgpu_umc_poison_handler with the 
reset flag. so driver already knows whether the reset is needed, right.
That's why I think "ras_error_status", "reset" are all not necessary. You can 
either call the followings directly by checking connected_to_cpu && reset,

+               kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
+               amdgpu_ras_reset_gpu(adev);

Or still provide a wrapper like amdgpu_umc_poison_handler_mca for above two 
calls.

The latter seems redundant as well. I mean, we don't need to maintain a 
specific API for poison handling fallback on MCA platform - Aldebaran is
the last SOC that supports this A + A RAS design. I can confirm we'll move to a 
new design going forward.

Regards,
Hawking

-----Original Message-----
From: Zhou1, Tao <[email protected]> 
Sent: Friday, October 21, 2022 10:54
To: Zhang, Hawking <[email protected]>; [email protected]; 
Yang, Stanley <[email protected]>; Chai, Thomas <[email protected]>; Li, 
Candice <[email protected]>
Subject: RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for MCA

[AMD Official Use Only - General]


> -----Original Message-----
> From: Zhang, Hawking <[email protected]>
> Sent: Thursday, October 20, 2022 5:13 PM
> To: Zhou1, Tao <[email protected]>; [email protected]; 
> Yang, Stanley <[email protected]>; Chai, Thomas 
> <[email protected]>; Li, Candice <[email protected]>
> Subject: RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions 
> for MCA
> 
> [AMD Official Use Only - General]
> 
> +static int amdgpu_umc_poison_handler_mca(struct amdgpu_device *adev,
> +             struct ras_err_data *err_data, bool reset)
> 
> 
> +     if (adev->gmc.xgmi.connected_to_cpu) {
> +             ret = amdgpu_umc_poison_handler_mca(adev,
> ras_error_status, reset);
> 
> The input parameters "reset" and "err_data" can be dropped since 
> amdgpu_umc_poison_handler_mca is dedicated to MCA platform

[Tao] whether need to do gpu reset is determined by unmap queue status, so 
reset parameter can't be dropped.
For "err_data", it can be removed currently, but I'm afraid we may need it on 
other ASICs in the future.

> 
> Regards,
> Hawking
> 
> -----Original Message-----
> From: Zhou1, Tao <[email protected]>
> Sent: Wednesday, October 19, 2022 16:12
> To: [email protected]; Zhang, Hawking 
> <[email protected]>; Yang, Stanley <[email protected]>; Chai, 
> Thomas <[email protected]>; Li, Candice <[email protected]>
> Cc: Zhou1, Tao <[email protected]>
> Subject: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for 
> MCA
> 
> Define page retirement functions for MCA platform.
> 
> v2: remove page retirement handling from MCA poison handler,
>     let MCA notifier do page retirement.
> 
> Signed-off-by: Tao Zhou <[email protected]>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 67
> +++++++++++++++++++++++++  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> |  2 +
>  2 files changed, 69 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> index aad3c8b4c810..9494fa14db9a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> @@ -22,6 +22,73 @@
>   */
> 
>  #include "amdgpu.h"
> +#include "umc_v6_7.h"
> +
> +static int amdgpu_umc_convert_error_address(struct amdgpu_device *adev,
> +                                 struct ras_err_data *err_data, uint64_t
> err_addr,
> +                                 uint32_t ch_inst, uint32_t umc_inst) {
> +     switch (adev->ip_versions[UMC_HWIP][0]) {
> +     case IP_VERSION(6, 7, 0):
> +             umc_v6_7_convert_error_address(adev,
> +                             err_data, err_addr, ch_inst, umc_inst);
> +             break;
> +     default:
> +             dev_warn(adev->dev,
> +                      "UMC address to Physical address translation is not
> supported\n");
> +             return AMDGPU_RAS_FAIL;
> +     }
> +
> +     return AMDGPU_RAS_SUCCESS;
> +}
> +
> +int amdgpu_umc_page_retirement_mca(struct amdgpu_device *adev,
> +                     uint64_t err_addr, uint32_t ch_inst, uint32_t umc_inst)
> {
> +     struct ras_err_data err_data = {0, 0, 0, NULL};
> +     int ret = AMDGPU_RAS_FAIL;
> +
> +     err_data.err_addr =
> +             kcalloc(adev->umc.max_ras_err_cnt_per_query,
> +                     sizeof(struct eeprom_table_record), GFP_KERNEL);
> +     if (!err_data.err_addr) {
> +             dev_warn(adev->dev,
> +                     "Failed to alloc memory for umc error record in MCA
> notifier!\n");
> +             return AMDGPU_RAS_FAIL;
> +     }
> +
> +     /*
> +      * Translate UMC channel address to Physical address
> +      */
> +     ret = amdgpu_umc_convert_error_address(adev, &err_data, err_addr,
> +                                     ch_inst, umc_inst);
> +     if (ret)
> +             goto out;
> +
> +     if (amdgpu_bad_page_threshold != 0) {
> +             amdgpu_ras_add_bad_pages(adev, err_data.err_addr,
> +                                             err_data.err_addr_cnt);
> +             amdgpu_ras_save_bad_pages(adev);
> +     }
> +
> +out:
> +     kfree(err_data.err_addr);
> +     return ret;
> +}
> +
> +static int amdgpu_umc_poison_handler_mca(struct amdgpu_device *adev,
> +             struct ras_err_data *err_data, bool reset) {
> +     /* MCA poison handler is only responsible for GPU reset,
> +      * let MCA notifier do page retirement.
> +      */
> +     if (reset) {
> +             kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
> +             amdgpu_ras_reset_gpu(adev);
> +     }
> +
> +     return AMDGPU_RAS_SUCCESS;
> +}
> 
>  static int amdgpu_umc_do_page_retirement(struct amdgpu_device *adev,
>               void *ras_error_status,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> index 3629d8f292ef..659a10de29c9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> @@ -98,4 +98,6 @@ void amdgpu_umc_fill_error_record(struct 
> ras_err_data *err_data,  int amdgpu_umc_process_ras_data_cb(struct 
> amdgpu_device *adev,
>               void *ras_error_status,
>               struct amdgpu_iv_entry *entry);
> +int amdgpu_umc_page_retirement_mca(struct amdgpu_device *adev,
> +                     uint64_t err_addr, uint32_t ch_inst, uint32_t umc_inst);
>  #endif
> --
> 2.35.1

RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for MCA

Reply via email to