amdgpu: fix deadlock caused by overflow

Andrey Grodzovsky Fri, 16 Sep 2022 06:50:27 -0700


On 2022-09-16 01:18, Christian König wrote:

Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky:
On 2022-09-15 15:26, Christian König wrote:
Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:
On 2022-09-15 06:09, Zhao, Victor wrote:
[AMD Official Use Only - General]

Hi Christian,
The test sequence is executing a compute engine hang while runninga lot of containers submitting gfx jobs. We have advanced tdr modeand mode2 reset enabled on driver.When a compute hang job timeout happens, the 2 jobs on the gfxpending list maybe signaled after drm_sched_stop. So they will notbe removed from pending list but have theDMA_FENCE_FLAG_SIGNALED_BIT set.At the amdgpu_device_recheck_guilty_jobs step, the first job willbe rerun and removed from pending list.At the resubmit setp, the second job (with signaled bit) will beresubmitted. Since it still has signaled bit, drm_sched_job_donewill be called directly. This decrease the hw_rq_count whichallows more jobs emitted but did not clean fence_drv rcu ptr.This results in an overflow in the fence_drv. Since we will usenum_fences_mask in amdgpu_fence_process, when overflow happens,the signal of some job will be skipped which result in an infinitewait for the fence_drv rcu ptr.
So close irq before sched_stop could avoid signal jobs afterdrm_sched_stop. And signal job one by one in fence_process insteadof using a mask will handle the overflow situation.
Another fix could be skip submitting jobs which already signaledduring resubmit stage, which may look cleaner.
Please help give some advice.
How about the code bellow instead ? The real problem is that wereuse a dma fence twice which is not according to fma fence design,so maybe this can help ?
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.cb/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 8adeb7469f1e..033f0ae16784 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring*ring, struct dma_fence **f, struct amd
        if (job && job->job_run_counter) {
                /* reinit seq for resubmitted jobs */
                fence->seqno = seq;
+
+               /* For resubmitted job clear the singled bit */
+ celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT,&fence->flags);
+
Upstream will pretty much kill you for that.
Re-setting a fence from a signaled to an unsignaled state is amassive no-go.
Christian.
Is it worse then doing fence->seqno = seq; ? This is already a hugehack , no ?
No, it's as equally bad. I don't think we can do either.

Christian.

And all those ugly hack are there because we reuse a dma_fence (hw_fenceembedded into the job) and correct me if I am wrong

but I don't think dma_fence is ever supposed to be reused.

So maybe like Victor suggested we should move close and flush irq beforesched_stop - this in my opinion should solve the issue, but Victor - whythen you still need the change in amdgpu_fence_process ? You will nothave the overflow situation because by moving irq_disable before stopany job that signaled will be removed from the scheduler pending listanyway. Also not that this change reverts 'drm/amdgpu: sanitize fencenumbers' and could reintroduce that bug.


Andrey

Andrey
/* TO be inline with external fence creation andother drivers */
                dma_fence_get(fence);
        } else {


Andrey
Thanks,
Victor



-----Original Message-----
From: Koenig, Christian <christian.koe...@amd.com>
Sent: Thursday, September 15, 2022 2:32 PM
To: Zhao, Victor <victor.z...@amd.com>;amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey<andrey.grodzov...@amd.com>
Cc: Deng, Emily <emily.d...@amd.com>
Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow



Am 15.09.22 um 06:02 schrieb Zhao, Victor:
[AMD Official Use Only - General]

Ping.

Hi @Koenig, Christian and @Grodzovsky, Andrey,
We found some reset related issues during stress test on thesequence. Please help give some comments.
Thanks,
Victor



-----Original Message-----
From: Victor Zhao <victor.z...@amd.com>
Sent: Wednesday, September 14, 2022 6:10 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deng, Emily <emily.d...@amd.com>; Grodzovsky, Andrey
<andrey.grodzov...@amd.com>; Zhao, Victor <victor.z...@amd.com>
Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

[background]
For a gpu recovery caused by a hang on one ring (e.g. compute),jobs from another ring (e.g. gfx) may continue signaling duringdrm_sched_stop stage. The signal bit will not be cleared.
At the resubmit stage after recovery, the job with hw fencesignaled bit set will call job done directly instead go throughfence process.This makes the hw_rq_count decrease but rcu fence pointer notcleared yet.
Then overflow happens in the fence driver slots and some jobs maybe skipped and leave the rcu pointer not cleared which makes aninfinite wait for the slot on the next fence emitted.
This infinite wait cause a job timeout on the emitting job. Anddriver will stuck at the its sched stop step because kthread_parkcannot be done.
[how]
1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
before drm sched stop 2. handle all fences in fence process to aviod
skip when overflow happens

Signed-off-by: Victor Zhao <victor.z...@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16+++++++++++++--- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 6+-----
   2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 943c9e750575..c0cfae52f12b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(structamdgpu_device *adev,
           amdgpu_virt_fini_data_exchange(adev);
       }
   -    amdgpu_fence_driver_isr_toggle(adev, true);
-
       /* block all schedulers and reset given job's ring */
       for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6+5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
amdgpu_device_ip_need_full_reset(tmp_adev))
               amdgpu_ras_suspend(tmp_adev);
   +        amdgpu_fence_driver_isr_toggle(tmp_adev, true);
+
           for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
               struct amdgpu_ring *ring = tmp_adev->rings[i];
@@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(structamdgpu_device *adev,
atomic_inc(&tmp_adev->gpu_reset_counter);
       }
   -    if (need_emergency_restart)
+    if (need_emergency_restart) {
+ list_for_each_entry (tmp_adev, device_list_handle,reset_list) {
+            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
+        }
           goto skip_sched_resume;
+    }
          /*
* Must check guilty signal here since after this pointall old @@ -5240,6 +5244,9 @@ intamdgpu_device_gpu_recover(struct amdgpu_device *adev,
       if (job && dma_fence_is_signaled(&job->hw_fence)) {
           job_signaled = true;
dev_info(adev->dev, "Guilty job already signaled,skipping HW
reset");
+ list_for_each_entry (tmp_adev, device_list_handle,reset_list) {
+            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
+        }
           goto skip_hw_reset;
       }
@@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(structamdgpu_device *adev,
           if (r && r == -EAGAIN) {
set_bit(AMDGPU_SKIP_MODE2_RESET,&reset_context->flags);
               adev->asic_reset_res = 0;
+            amdgpu_fence_driver_isr_toggle(adev, true);
               goto retry;
           }
       }
@@ -5711,6 +5719,8 @@ pci_ers_result_tamdgpu_pci_slot_reset(struct pci_dev *pdev)
       set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
       set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
   +    amdgpu_fence_driver_isr_toggle(adev, true);
+
       adev->no_hw_access = true;
       r = amdgpu_device_pre_asic_reset(adev, &reset_context);
       adev->no_hw_access = false;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 8adeb7469f1e..65a877e1a7fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -287,15 +287,11 @@ bool amdgpu_fence_process(structamdgpu_ring *ring)
       if (unlikely(seq == last_seq))
           return false;
   -    last_seq &= drv->num_fences_mask;
-    seq &= drv->num_fences_mask;
-
       do {
           struct dma_fence *fence, **ptr;
              ++last_seq;
-        last_seq &= drv->num_fences_mask;
-        ptr = &drv->fences[last_seq];
+        ptr = &drv->fences[last_seq & drv->num_fences_mask];
/* There is always exactly one thread signalingthis fence slot */
           fence = rcu_dereference_protected(*ptr, 1);
Those changes here doesn't seem to make sense. Please explainfurther why that is necessary.
Christian.
--
2.25.1

Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

Reply via email to