RE: [PATCH] drm/amdgpu: dma_fence finished signaled by unexpected callback

Lou, Wentao Tue, 25 Dec 2018 22:38:24 -0800

Hi Andrey,

In amd-staging-dkms-4.18's merged list, I can't find 'drm/sched: Refactor ring 
mirror list handling', neither 'drm/sched: Rework HW fence processing'.
Now there was still much Call-Trace in new osdb triggered in 
dma_fence_set_error. Do you have link for these patches?
Thanks.


BR,
Wentao


-----Original Message-----
From: Grodzovsky, Andrey <[email protected]> 
Sent: Saturday, December 22, 2018 12:57 AM
To: Lou, Wentao <[email protected]>; [email protected]
Subject: Re: [PATCH] drm/amdgpu: dma_fence finished signaled by unexpected 
callback

I believe this issue would be resolved by my pending  in review patch set, 
specifically 'drm/sched: Refactor ring mirror list handling.' since already on 
the first TO handler it will go over all the rings including the second timed 
out ring and will remove all call backs including the bad job cb. In case by 
this time this bad job will signal for some reason it will be removed from the 
mirror list already during drm_sched_process_job (take a look at 'drm/sched: 
Rework HW fence
processing.') and hence will not be rerun in drm_sched_job_recovery 
(drm_sched_resubmit_jobs under the new name).

Andrey


On 12/21/2018 03:25 AM, wentalou wrote:
> When 2 rings met timeout at same time, triggered job_timedout separately.
> Each job_timedout called gpu_recover, but one of gpu_recover locked by 
> another's mutex_lock.
> Bad jod’s callback should be removed by dma_fence_remove_callback but locked 
> inside mutex_lock.
> So dma_fence_remove_callback could not be called immediately.
> Then callback drm_sched_process_job triggered unexpectedly, and signaled 
> DMA_FENCE_FLAG_SIGNALED_BIT.
> After another's mutex_unlock, signaled bad job went through job_run inside 
> drm_sched_job_recovery.
> job_run would have WARN_ON and Call-Trace, when calling 
> kcl_dma_fence_set_error for signaled bad job.
>
> Change-Id: I6366add13f020476882b2b8b03330a58d072dd1a
> Signed-off-by: Wentao Lou <[email protected]>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 0a17fb1..fc1d3a0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -225,8 +225,11 @@ static struct dma_fence *amdgpu_job_run(struct 
> drm_sched_job *sched_job)
>   
>       trace_amdgpu_sched_run_job(job);
>   
> -     if (job->vram_lost_counter != 
> atomic_read(&ring->adev->vram_lost_counter))
> +     if (job->vram_lost_counter != 
> atomic_read(&ring->adev->vram_lost_counter)) {
> +             /* flags might be signaled by unexpected callback, clear it */
> +             test_and_clear_bit(DMA_FENCE_FLAG_SIGNALED_BIT, 
> &finished->flags);
>               dma_fence_set_error(finished, -ECANCELED);/* skip IB as well if 
> VRAM lost */
> +     }
>   
>       if (finished->error < 0) {
>               DRM_INFO("Skip scheduling IBs!\n");

_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: dma_fence finished signaled by unexpected callback

Reply via email to