They are still not pushed - I am fixing review comments. Andrey
On 12/26/2018 01:38 AM, Lou, Wentao wrote: > Hi Andrey, > > In amd-staging-dkms-4.18's merged list, I can't find 'drm/sched: Refactor > ring mirror list handling', neither 'drm/sched: Rework HW fence processing'. > Now there was still much Call-Trace in new osdb triggered in > dma_fence_set_error. Do you have link for these patches? > Thanks. > > BR, > Wentao > > > -----Original Message----- > From: Grodzovsky, Andrey <andrey.grodzov...@amd.com> > Sent: Saturday, December 22, 2018 12:57 AM > To: Lou, Wentao <wentao....@amd.com>; amd-gfx@lists.freedesktop.org > Subject: Re: [PATCH] drm/amdgpu: dma_fence finished signaled by unexpected > callback > > I believe this issue would be resolved by my pending in review patch set, > specifically 'drm/sched: Refactor ring mirror list handling.' since already > on the first TO handler it will go over all the rings including the second > timed out ring and will remove all call backs including the bad job cb. In > case by this time this bad job will signal for some reason it will be removed > from the mirror list already during drm_sched_process_job (take a look at > 'drm/sched: Rework HW fence > processing.') and hence will not be rerun in drm_sched_job_recovery > (drm_sched_resubmit_jobs under the new name). > > Andrey > > > On 12/21/2018 03:25 AM, wentalou wrote: >> When 2 rings met timeout at same time, triggered job_timedout separately. >> Each job_timedout called gpu_recover, but one of gpu_recover locked by >> another's mutex_lock. >> Bad jod’s callback should be removed by dma_fence_remove_callback but locked >> inside mutex_lock. >> So dma_fence_remove_callback could not be called immediately. >> Then callback drm_sched_process_job triggered unexpectedly, and signaled >> DMA_FENCE_FLAG_SIGNALED_BIT. >> After another's mutex_unlock, signaled bad job went through job_run inside >> drm_sched_job_recovery. >> job_run would have WARN_ON and Call-Trace, when calling >> kcl_dma_fence_set_error for signaled bad job. >> >> Change-Id: I6366add13f020476882b2b8b03330a58d072dd1a >> Signed-off-by: Wentao Lou <wentao....@amd.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 5 ++++- >> 1 file changed, 4 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> index 0a17fb1..fc1d3a0 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> @@ -225,8 +225,11 @@ static struct dma_fence *amdgpu_job_run(struct >> drm_sched_job *sched_job) >> >> trace_amdgpu_sched_run_job(job); >> >> - if (job->vram_lost_counter != >> atomic_read(&ring->adev->vram_lost_counter)) >> + if (job->vram_lost_counter != >> atomic_read(&ring->adev->vram_lost_counter)) { >> + /* flags might be signaled by unexpected callback, clear it */ >> + test_and_clear_bit(DMA_FENCE_FLAG_SIGNALED_BIT, >> &finished->flags); >> dma_fence_set_error(finished, -ECANCELED);/* skip IB as well if >> VRAM lost */ >> + } >> >> if (finished->error < 0) { >> DRM_INFO("Skip scheduling IBs!\n"); _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx