amdgpu: fix potential bad job hw_fence underflow

Andrey Grodzovsky Fri, 22 Oct 2021 13:42:10 -0700

What do you mean by underflow in this case ? You mean use after freebecause of extra dma_fence_put() ?


On 2021-10-22 4:14 a.m., JingWen Chen wrote:

ping

On 2021/10/22 AM11:33, Jingwen Chen wrote:

[Why]
In advance tdr mode, the real bad job will be resubmitted twice, while
in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
is put one more time than other jobs.

[How]
Adding dma_fence_get before resbumit job in
amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs

Signed-off-by: Jingwen Chen <[email protected]>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
  1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 41ce86244144..975f069f6fe8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(

/* clear job's guilty and depend the folowing step to decide the real one */

                drm_sched_reset_karma(s_job);
+               /* for the real bad job, it will be resubmitted twice, adding a 
dma_fence_get
+                * to make sure fence is balanced */

But that put in drm_sched_resubmit_jobs_ext is for the previous parentfence.fence = sched->ops->run_job(s_job); returns a new HW fence and the putdrops the refcount on the old one.


Andrey

+               dma_fence_get(s_job->s_fence->parent);
                drm_sched_resubmit_jobs_ext(&ring->sched, 1);

ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);

@@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(

/* got the hw fence, signal finished fence */

                atomic_dec(ring->sched.score);
+               dma_fence_put(s_job->s_fence->parent);
                dma_fence_get(&s_job->s_fence->finished);
                dma_fence_signal(&s_job->s_fence->finished);
                dma_fence_put(&s_job->s_fence->finished);

Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow

Reply via email to