[AMD Official Use Only - General] Hi @Deucher, Alexander<mailto:[email protected]> and @Koenig, Christian<mailto:[email protected]>
Could you help review this patch? Without this patch, when customer set `reset_method=3` modprobe param to use mode2 reset, ras recovery will also use mode2 reset and skip mode1 reset. When ECC error happens, GPU can’t be recovered with mode2 reset and mode1 reset is skipped, this will cause GPU reset failure. This patch is to always use mode1 reset for ras recovery (ECC error) when setting `reset_method=3`. Thanks Sam From: Feng, Kenneth <[email protected]> Date: Monday, April 29, 2024 at 16:15 To: Feng, Kenneth <[email protected]>, [email protected] <[email protected]>, Zhang, GuoQing (Sam) <[email protected]> Cc: Zhang, Owen(SRDC) <[email protected]>, Aldabagh, Maad <[email protected]>, Ma, Qing (Mark) <[email protected]> Subject: RE: [PATCH 2/2] drm/amd/amdgpu: use the default reset for ras recovery [AMD Official Use Only - General] +@Zhang, GuoQing (Sam) -----Original Message----- From: Kenneth Feng <[email protected]> Sent: Monday, April 29, 2024 3:32 PM To: [email protected] Cc: Zhang, Owen(SRDC) <[email protected]>; Aldabagh, Maad <[email protected]>; Ma, Qing (Mark) <[email protected]>; Feng, Kenneth <[email protected]> Subject: [PATCH 2/2] drm/amd/amdgpu: use the default reset for ras recovery use the default reset for ras recovery Signed-off-by: Kenneth Feng <[email protected]> --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index a037e8fba29f..f92b2c4f0d5c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c @@ -2437,6 +2437,7 @@ static void amdgpu_ras_do_recovery(struct work_struct *work) struct amdgpu_device *adev = ras->adev; struct list_head device_list, *device_list_handle = NULL; struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev); + int save_reset_method = amdgpu_reset_method; if (hive) { atomic_set(&hive->ras_recovery, 1); @@ -2501,7 +2502,13 @@ static void amdgpu_ras_do_recovery(struct work_struct *work) } } + if (amdgpu_gpu_recovery == 2) + amdgpu_reset_method = -1; + amdgpu_device_gpu_recover(ras->adev, NULL, &reset_context); + + if (amdgpu_gpu_recovery == 2) + amdgpu_reset_method = save_reset_method; } atomic_set(&ras->in_recovery, 0); if (hive) { -- 2.34.1
