Re: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR

Christian König Fri, 09 Nov 2018 04:19:23 -0800

Am 09.11.18 um 13:10 schrieb Trigger Huang:

A bad job is the one triggered TDR. In the recovery process, its fence
will be signaled and as a result, the work behind will be scheduled
to delete it from the mirror list, but if the TDR process is invoked
before the work's execution, then the call dma_fence_set_error to its
fence in TDR process will lead to kernel warning trace:


[  143.033605] WARNING: CPU: 2 PID: 53 at ./include/linux/dma-fence.h:437 
amddrm_sched_job_recovery+0x1af/0x1c0 [amd_sched]
kernel: [  143.033606] Modules linked in: amdgpu(OE) amdchash(OE) amdttm(OE) 
amd_sched(OE) amdkcl(OE) amd_iommu_v2 drm_kms_helper drm i2c_algo_bit 
fb_sys_fops syscopyarea sysfillrect sysimgblt kvm_intel kvm irqbypass 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 
snd_hda_codec_generic crypto_simd glue_helper cryptd snd_hda_intel 
snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event 
snd_rawmidi snd_seq joydev snd_seq_device snd_timer snd soundcore binfmt_misc 
input_leds mac_hid serio_raw nfsd auth_rpcgss nfs_acl lockd grace sunrpc 
sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 8139too 
floppy psmouse 8139cp mii i2c_piix4 pata_acpi
[  143.033649] CPU: 2 PID: 53 Comm: kworker/2:1 Tainted: G           OE    
4.15.0-20-generic #21-Ubuntu
[  143.033650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  143.033653] Workqueue: events drm_sched_job_timedout [amd_sched]
[  143.033656] RIP: 0010:amddrm_sched_job_recovery+0x1af/0x1c0 [amd_sched]
[  143.033657] RSP: 0018:ffffa9f880fe7d48 EFLAGS: 00010202
[  143.033659] RAX: 0000000000000007 RBX: ffff9b98f2b24c00 RCX: ffff9b98efef4f08
[  143.033660] RDX: ffff9b98f2b27400 RSI: ffff9b98f2b24c50 RDI: ffff9b98efef4f18
[  143.033660] RBP: ffffa9f880fe7d98 R08: 0000000000000001 R09: 00000000000002b6
[  143.033661] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9b98efef3430
[  143.033662] R13: ffff9b98efef4d80 R14: ffff9b98efef4e98 R15: ffff9b98eaf91c00
[  143.033663] FS:  0000000000000000(0000) GS:ffff9b98ffd00000(0000) 
knlGS:0000000000000000
[  143.033664] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  143.033665] CR2: 00007fc49c96d470 CR3: 000000001400a005 CR4: 00000000003606e0
[  143.033669] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  143.033669] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  143.033670] Call Trace:
[  143.033744]  amdgpu_device_gpu_recover+0x144/0x820 [amdgpu]
[  143.033788]  amdgpu_job_timedout+0x9b/0xa0 [amdgpu]
[  143.033791]  drm_sched_job_timedout+0xcc/0x150 [amd_sched]
[  143.033795]  process_one_work+0x1de/0x410
[  143.033797]  worker_thread+0x32/0x410
[  143.033799]  kthread+0x121/0x140
[  143.033801]  ? process_one_work+0x410/0x410
[  143.033803]  ? kthread_create_worker_on_cpu+0x70/0x70
[  143.033806]  ret_from_fork+0x35/0x40

So just delete the bad job from mirror list directly after signal its fence

Signed-off-by: Trigger Huang <[email protected]>
---
  drivers/gpu/drm/scheduler/sched_main.c | 18 +++++++++++++++++-
  1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 18ebbb0..b2832fb 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -228,7 +228,10 @@ static void drm_sched_job_finish(struct work_struct *work)

spin_lock(&sched->job_list_lock);

        /* remove job from ring_mirror_list */
-       list_del(&s_job->node);
+       /* This node may already has been deleted in job recovery */
+       if (s_job->node.next != &(s_job->node))
+               list_del_init(&s_job->node);
+

Well that doesn't make any sense at all. If the list_head is alreadyempty (pointing to itself) list_del is a no-op.

        /* queue TDR for next job */
        drm_sched_start_timeout(sched);
        spin_unlock(&sched->job_list_lock);
@@ -394,6 +397,19 @@ void drm_sched_job_recovery(struct drm_gpu_scheduler 
*sched)
                        drm_sched_process_job(NULL, &s_fence->cb);
                }
                spin_lock(&sched->job_list_lock);
+               /**
+                * Normally the 'bad' job will be deleted by its finish_work
+                * after signal its fence, but sometimes if a GPU recovery is
+                * invoked before this work finished execution (for example,
+                * KFD/Host triggered the GPU reset when the current one is
+                * on-going), then the 'bad' job may will be processed again,
+                * which is definitely no necessary, and also will cause a lot
+                * of warning call traces when this job is set by
+                * 'dma_fence_set_error' because it has already been signaled
+                */
+               if ((s_fence->finished.error < 0)
+                       && (s_job->node.next != &(s_job->node)))
+                       list_del_init(&s_job->node);


That strongly looks incorrect as well.

Can you describe more specific what you are trying to do here?

Thanks,
Christian.

        }
        drm_sched_start_timeout(sched);
        spin_unlock(&sched->job_list_lock);


_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR

Reply via email to