Hi Christian

Thanks for the correction before, I gave my explanation as below, would you 
help to check again, thanks in advance.

The key thing here is, if a job’s fence is signaled already, then call 
dma_fence_set_error to its fence will lead to a kernel warning call trace
static inline void dma_fence_set_error(struct dma_fence *fence,
                                                                       int 
error)
{
                WARN_ON(test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags));  
 # here is warning source
                WARN_ON(error >= 0 || error < -MAX_ERRNO);
 
                fence->error = error;
}
 
 
Then, let’s see a guilty job’s process flow in TDR.                             
                                                                                
                                                 
        amdgpu_device_gpu_recover -> drm_sched_job_recovery  -> for each job in 
ring_mirror_list :
                 1)            dma_fence_set_error(&s_fence->finished, 
-ECANCELED) if this job is guilty
                 2)            amdgpu_job_run: the guilty job will be skipped, 
and not submitted into to ring
                 3)            drm_sched_process_job(guilty_job) -> 
drm_sched_fence_finished -> dma_fence_signal(&fence->finished) -> 
drm_sched_job_finish_cb -> schedule_work(&job->finish_work);
                         Later, finish_work’s callback function,  
drm_sched_job_finish, will be called by work queue, and guilty job will finally 
be deleted from ring_mirror_list.
        But sometimes, amdgpu_device_gpu_recover will be called(from host FLR 
interrupt or KFD) again immediately, and at this moment, finish_work  is not 
scheduled by work queue, and drm_sched_job_finish for the guilty job is not 
called yet, so this guilty job is still in the   ring_mirror_list. But 
remember, the guilty job’s fence was signaled before (by 
drm_sched_fence_finished), followed by the TDR process, dma_fence_set_error 
will be called again to this guilty, then cause the kernel warning call trace.
 
So the root cause is, the finish_work for each job  is not scheduled quickly 
enough for GPU TDR scenario.  
        Three ideas:
                1), Instead of using the system global work queue, maybe we can 
create a dedicate work queue to manage the job finish things, but still can’t 
guarantee its finish_work’s execution in GPU TDR scenario
                2), Wait for all the guilty jobs’ finish_work to be finished 
execution before return from drm_sched_job_recovery  , but I think it will 
waste too much time.
                3), to delete the guilty job from the ring_mirror_list directly 
after this job is processed in amdgpu_job_run().        amdgpu_job_run is the 
final entrance for a job to be submitted into the ring, if something wrong with 
this job:(finished->error < 0), then this job should be                   never 
be submitted again, and so should be deleted from the recovery list.
I choose 3).


I am going to make a new patch as below instead of the original one

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index e0af44f..aaf697f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -208,6 +208,7 @@ static struct dma_fence *amdgpu_job_dependency(struct 
drm_sched_job *sched_job,
 static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
 {
        struct amdgpu_ring *ring = to_amdgpu_ring(sched_job->sched);
+       struct drm_gpu_scheduler *sched = sched_job->sched;
        struct dma_fence *fence = NULL, *finished;
        struct amdgpu_job *job;
        int r;
@@ -223,7 +224,10 @@ static struct dma_fence *amdgpu_job_run(struct 
drm_sched_job *sched_job)
                dma_fence_set_error(finished, -ECANCELED);/* skip IB as well if 
VRAM lost */

        if (finished->error < 0) {      
-               DRM_INFO("Skip scheduling IBs!\n");
+               DRM_INFO("Skip scheduling IBs and delete job from recovery 
list!\n");
+               spin_lock(&sched->job_list_lock);
+               list_del_init(&sched_job->node);
+               spin_unlock(&sched->job_list_lock);
        } else {
                r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs, job,
                                       &fence);


Thanks,
Trigger

-----Original Message-----
From: Christian König <ckoenig.leichtzumer...@gmail.com> 
Sent: Friday, November 09, 2018 8:18 PM
To: Huang, Trigger <trigger.hu...@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR

Am 09.11.18 um 13:10 schrieb Trigger Huang:
> A bad job is the one triggered TDR. In the recovery process, its fence 
> will be signaled and as a result, the work behind will be scheduled to 
> delete it from the mirror list, but if the TDR process is invoked 
> before the work's execution, then the call dma_fence_set_error to its 
> fence in TDR process will lead to kernel warning trace:
>
> [  143.033605] WARNING: CPU: 2 PID: 53 at 
> ./include/linux/dma-fence.h:437 amddrm_sched_job_recovery+0x1af/0x1c0 
> [amd_sched]
> kernel: [  143.033606] Modules linked in: amdgpu(OE) amdchash(OE) amdttm(OE) 
> amd_sched(OE) amdkcl(OE) amd_iommu_v2 drm_kms_helper drm i2c_algo_bit 
> fb_sys_fops syscopyarea sysfillrect sysimgblt kvm_intel kvm irqbypass 
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 
> snd_hda_codec_generic crypto_simd glue_helper cryptd snd_hda_intel 
> snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event 
> snd_rawmidi snd_seq joydev snd_seq_device snd_timer snd soundcore binfmt_misc 
> input_leds mac_hid serio_raw nfsd auth_rpcgss nfs_acl lockd grace sunrpc 
> sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 8139too 
> floppy psmouse 8139cp mii i2c_piix4 pata_acpi
> [  143.033649] CPU: 2 PID: 53 Comm: kworker/2:1 Tainted: G           OE    
> 4.15.0-20-generic #21-Ubuntu
> [  143.033650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [  143.033653] Workqueue: events 
> drm_sched_job_timedout [amd_sched] [  143.033656] RIP: 
> 0010:amddrm_sched_job_recovery+0x1af/0x1c0 [amd_sched] [  143.033657] 
> RSP: 0018:ffffa9f880fe7d48 EFLAGS: 00010202 [  143.033659] RAX: 
> 0000000000000007 RBX: ffff9b98f2b24c00 RCX: ffff9b98efef4f08 [  
> 143.033660] RDX: ffff9b98f2b27400 RSI: ffff9b98f2b24c50 RDI: 
> ffff9b98efef4f18 [  143.033660] RBP: ffffa9f880fe7d98 R08: 
> 0000000000000001 R09: 00000000000002b6 [  143.033661] R10: 
> 0000000000000000 R11: 0000000000000000 R12: ffff9b98efef3430 [  
> 143.033662] R13: ffff9b98efef4d80 R14: ffff9b98efef4e98 R15: 
> ffff9b98eaf91c00 [  143.033663] FS:  0000000000000000(0000) 
> GS:ffff9b98ffd00000(0000) knlGS:0000000000000000 [  143.033664] CS:  
> 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [  143.033665] CR2: 
> 00007fc49c96d470 CR3: 000000001400a005 CR4: 00000000003606e0 [  143.033669] 
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [  
> 143.033669] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 
> [  143.033670] Call Trace:
> [  143.033744]  amdgpu_device_gpu_recover+0x144/0x820 [amdgpu] [  
> 143.033788]  amdgpu_job_timedout+0x9b/0xa0 [amdgpu] [  143.033791]  
> drm_sched_job_timedout+0xcc/0x150 [amd_sched] [  143.033795]  
> process_one_work+0x1de/0x410 [  143.033797]  worker_thread+0x32/0x410 
> [  143.033799]  kthread+0x121/0x140 [  143.033801]  ? 
> process_one_work+0x410/0x410 [  143.033803]  ? 
> kthread_create_worker_on_cpu+0x70/0x70
> [  143.033806]  ret_from_fork+0x35/0x40
>
> So just delete the bad job from mirror list directly after signal its 
> fence
>
> Signed-off-by: Trigger Huang <trigger.hu...@amd.com>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 18 +++++++++++++++++-
>   1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 18ebbb0..b2832fb 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -228,7 +228,10 @@ static void drm_sched_job_finish(struct 
> work_struct *work)
>   
>       spin_lock(&sched->job_list_lock);
>       /* remove job from ring_mirror_list */
> -     list_del(&s_job->node);
> +     /* This node may already has been deleted in job recovery */
> +     if (s_job->node.next != &(s_job->node))
> +             list_del_init(&s_job->node);
> +

Well that doesn't make any sense at all. If the list_head is already empty 
(pointing to itself) list_del is a no-op.

>       /* queue TDR for next job */
>       drm_sched_start_timeout(sched);
>       spin_unlock(&sched->job_list_lock);
> @@ -394,6 +397,19 @@ void drm_sched_job_recovery(struct drm_gpu_scheduler 
> *sched)
>                       drm_sched_process_job(NULL, &s_fence->cb);
>               }
>               spin_lock(&sched->job_list_lock);
> +             /**
> +              * Normally the 'bad' job will be deleted by its finish_work
> +              * after signal its fence, but sometimes if a GPU recovery is
> +              * invoked before this work finished execution (for example,
> +              * KFD/Host triggered the GPU reset when the current one is
> +              * on-going), then the 'bad' job may will be processed again,
> +              * which is definitely no necessary, and also will cause a lot
> +              * of warning call traces when this job is set by
> +              * 'dma_fence_set_error' because it has already been signaled
> +              */
> +             if ((s_fence->finished.error < 0)
> +                     && (s_job->node.next != &(s_job->node)))
> +                     list_del_init(&s_job->node);

That strongly looks incorrect as well.

Can you describe more specific what you are trying to do here?

Thanks,
Christian.

>       }
>       drm_sched_start_timeout(sched);
>       spin_unlock(&sched->job_list_lock);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to