issues:
there are two issue may lead to TDR failure for SRIOV
1) gpu_recover() is re-entered by the mailbox interrupt
handler mxgpu_nv.c
2) MEC is ruined by the amdkfd_pre_reset after VF FLR done

fix:
for 1) we need to bypass the gpu_recover() invoke in mailbox
interrupt as long as the timeout is not infinite (thus the TDR
will be triggered automatically after time out, no need to invoke
gpu_recover() through mailbox interrupt.

for 2) amdkfd_pre_reset() would ruin MEC after hypervisor finished
the VF FLR, the correct sequence is do amdkfd_pre_reset before VF FLR
but there is a limitation to block this sequence:
if we do pre_reset() before VF FLR, it would go KIQ way to do register
access and stuck there, because KIQ probably won't work by that time
(e.g. you already made GFX hang)

Signed-off-by: Monk Liu <[email protected]>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      | 6 +++++-
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 605cef6..ae962b9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device 
*adev,
        if (r)
                return r;
 
-       amdgpu_amdkfd_pre_reset(adev);
-
        /* Resume IP prior to SMC */
        r = amdgpu_device_ip_reinit_early_sriov(adev);
        if (r)
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index 0d8767e..1c3a7d4 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -269,7 +269,11 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
        }
 
        /* Trigger recovery for world switch failure if no TDR */
-       if (amdgpu_device_should_recover_gpu(adev))
+       if (amdgpu_device_should_recover_gpu(adev)
+               && (adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT ||
+               adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
+               adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
+               adev->video_timeout == MAX_SCHEDULE_TIMEOUT))
                amdgpu_device_gpu_recover(adev, NULL);
 }
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to