On 1/14/21 8:37 AM, Horace Chen wrote:
Fix a racing issue when jobs on 2 rings timeout simultaneously.

If 2 rings timed out at the same time, the
amdgpu_device_gpu_recover will be reentered. Then the
adev->gmc.xgmi.head will be grabbed by 2 local linked list,
which may cause wild pointer issue in iterating.

lock the device earily to prevent the node be added to 2
different lists.

for xgmi there is a hive lock which can promise there won't have
2 devices on same hive reenter the interation. So xgmi doesn't
need to lock the device.


Note that amdgpu_device_lock_adev does bunch of other stuff besides taking
the lock and I don't think we want to skip them for the other devices in case of XGMI.

Andrey



Signed-off-by: Horace Chen <horace.c...@amd.com>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 ++++++++-------
  1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 4d434803fb49..a28e138ac72c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4591,19 +4591,20 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
*adev,
                        list_rotate_to_front(&adev->gmc.xgmi.head, 
&hive->device_list);
                device_list_handle = &hive->device_list;
        } else {
+               /* if current dev is already in reset, skip adding list to 
prevent race issue */
+               if (!amdgpu_device_lock_adev(adev, hive)) {
+                       dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as 
another already in progress",
+                                       job ? job->base.id : -1);
+                       r = 0;
+                       goto skip_recovery;
+               }
+
                list_add_tail(&adev->gmc.xgmi.head, &device_list);
                device_list_handle = &device_list;
        }
/* block all schedulers and reset given job's ring */
        list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
-               if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
-                       dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as 
another already in progress",
-                                 job ? job->base.id : -1);
-                       r = 0;
-                       goto skip_recovery;
-               }
-
                /*
                 * Try to put the audio codec into suspend state
                 * before gpu reset started.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to