RE: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR

2018-11-14 Thread Huang, Trigger
Hi guys,

I have modified the patch as V2 in other thread, would you help to review?

-Original Message-
From: amd-gfx  On Behalf Of Koenig, 
Christian
Sent: Monday, November 12, 2018 6:12 PM
To: Liu, Monk ; Huang, Trigger ; 
amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR

Hi guys,

yeah that is the same problem I was trying to fix with drm/sched: fix timeout 
handling v2.

But this case is even more complicated because here it is the hypervisor or KFD 
which triggers the second reset.

Going to take another look at this today, Christian.

Am 12.11.18 um 10:19 schrieb Liu, Monk:
> Hi Christian
>
> Trigger's patch intends to fix the double signaling on dma_fence from below 
> scenario:
>
> 1) Gpu_recovery(job = some job) invoked by guest TDR ==> the hang 
> job's fence is set as -ECANCELD and fake signaled 2)GPU_recovery(job = 
> NULL) again invoked by hypervisor or KFD, but the job of above step 
> hasn't finished its "drm_sched_job_finish" routine which is kicked by 
> queu_work(), due to the cause that the second gpu_recovery() is called 
> prior to "drm_sched_job_finish" (so the job is not took out from the 
> mirror_list,  so there will be duplicated dma_fence_set_error on that 
> job and give you warning messeage)
>
> /Monk
>
> -Original Message-
> From: amd-gfx  On Behalf Of 
> Huang, Trigger
> Sent: Monday, November 12, 2018 4:36 PM
> To: Koenig, Christian ; 
> amd-gfx@lists.freedesktop.org
> Subject: RE: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR
>
> Hi Christian
>
> Thanks for the correction before, I gave my explanation as below, would you 
> help to check again, thanks in advance.
>
> The key thing here is, if a job’s fence is signaled already, then call 
> dma_fence_set_error to its fence will lead to a kernel warning call trace 
> static inline void dma_fence_set_error(struct dma_fence *fence,
> int 
> error) {
>  WARN_ON(test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, 
> >flags));   # here is warning source
>  WARN_ON(error >= 0 || error < -MAX_ERRNO);
>   
>  fence->error = error; }
>   
>   
> Then, let’s see a guilty job’s process flow in TDR.
>   amdgpu_device_gpu_recover -> drm_sched_job_recovery  -> for each job in 
> ring_mirror_list :
>1)dma_fence_set_error(_fence->finished, 
> -ECANCELED) if this job is guilty
>2)amdgpu_job_run: the guilty job will be 
> skipped, and not submitted into to ring
>3)drm_sched_process_job(guilty_job) -> 
> drm_sched_fence_finished -> dma_fence_signal(>finished) -> 
> drm_sched_job_finish_cb -> schedule_work(>finish_work);
>Later, finish_work’s callback function,  
> drm_sched_job_finish, will be called by work queue, and guilty job will 
> finally be deleted from ring_mirror_list.
>   But sometimes, amdgpu_device_gpu_recover will be called(from host FLR 
> interrupt or KFD) again immediately, and at this moment, finish_work  is not 
> scheduled by work queue, and drm_sched_job_finish for the guilty job is not 
> called yet, so this guilty job is still in the   ring_mirror_list. But 
> remember, the guilty job’s fence was signaled before (by 
> drm_sched_fence_finished), followed by the TDR process, dma_fence_set_error 
> will be called again to this guilty, then cause the kernel warning call trace.
>   
> So the root cause is, the finish_work for each job  is not scheduled quickly 
> enough for GPU TDR scenario.
>   Three ideas:
>   1), Instead of using the system global work queue, maybe we can 
> create a dedicate work queue to manage the job finish things, but still can’t 
> guarantee its finish_work’s execution in GPU TDR scenario
>   2), Wait for all the guilty jobs’ finish_work to be finished 
> execution before return from drm_sched_job_recovery  , but I think it will 
> waste too much time.
>   3), to delete the guilty job from the ring_mirror_list directly 
> after this job is processed in amdgpu_job_run().amdgpu_job_run is the 
> final entrance for a job to be submitted into the ring, if something wrong 
> with this job:(finished->error < 0), then this job should be  
>  never be submitted again, and so should be deleted from the recovery list.
> I choose 3).
>
>
> I am going to make a new patch as below instead of the original one
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index e0

Re: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR

2018-11-12 Thread Koenig, Christian
Hi guys,

yeah that is the same problem I was trying to fix with drm/sched: fix 
timeout handling v2.

But this case is even more complicated because here it is the hypervisor 
or KFD which triggers the second reset.

Going to take another look at this today,
Christian.

Am 12.11.18 um 10:19 schrieb Liu, Monk:
> Hi Christian
>
> Trigger's patch intends to fix the double signaling on dma_fence from below 
> scenario:
>
> 1) Gpu_recovery(job = some job) invoked by guest TDR ==> the hang job's fence 
> is set as -ECANCELD and fake signaled
> 2)GPU_recovery(job = NULL) again invoked by hypervisor or KFD, but the job of 
> above step hasn't finished its "drm_sched_job_finish" routine which is kicked 
> by queu_work(), due to the cause that the second gpu_recovery() is called 
> prior to "drm_sched_job_finish" (so the job is not took out from the 
> mirror_list,  so there will be duplicated dma_fence_set_error on that job and 
> give you warning messeage)
>
> /Monk
>
> -Original Message-
> From: amd-gfx  On Behalf Of Huang, 
> Trigger
> Sent: Monday, November 12, 2018 4:36 PM
> To: Koenig, Christian ; 
> amd-gfx@lists.freedesktop.org
> Subject: RE: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR
>
> Hi Christian
>
> Thanks for the correction before, I gave my explanation as below, would you 
> help to check again, thanks in advance.
>
> The key thing here is, if a job’s fence is signaled already, then call 
> dma_fence_set_error to its fence will lead to a kernel warning call trace 
> static inline void dma_fence_set_error(struct dma_fence *fence,
> int 
> error) {
>  WARN_ON(test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, 
> >flags));   # here is warning source
>  WARN_ON(error >= 0 || error < -MAX_ERRNO);
>   
>  fence->error = error;
> }
>   
>   
> Then, let’s see a guilty job’s process flow in TDR.
>   amdgpu_device_gpu_recover -> drm_sched_job_recovery  -> for each job in 
> ring_mirror_list :
>1)dma_fence_set_error(_fence->finished, 
> -ECANCELED) if this job is guilty
>2)amdgpu_job_run: the guilty job will be 
> skipped, and not submitted into to ring
>3)drm_sched_process_job(guilty_job) -> 
> drm_sched_fence_finished -> dma_fence_signal(>finished) -> 
> drm_sched_job_finish_cb -> schedule_work(>finish_work);
>Later, finish_work’s callback function,  
> drm_sched_job_finish, will be called by work queue, and guilty job will 
> finally be deleted from ring_mirror_list.
>   But sometimes, amdgpu_device_gpu_recover will be called(from host FLR 
> interrupt or KFD) again immediately, and at this moment, finish_work  is not 
> scheduled by work queue, and drm_sched_job_finish for the guilty job is not 
> called yet, so this guilty job is still in the   ring_mirror_list. But 
> remember, the guilty job’s fence was signaled before (by 
> drm_sched_fence_finished), followed by the TDR process, dma_fence_set_error 
> will be called again to this guilty, then cause the kernel warning call trace.
>   
> So the root cause is, the finish_work for each job  is not scheduled quickly 
> enough for GPU TDR scenario.
>   Three ideas:
>   1), Instead of using the system global work queue, maybe we can 
> create a dedicate work queue to manage the job finish things, but still can’t 
> guarantee its finish_work’s execution in GPU TDR scenario
>   2), Wait for all the guilty jobs’ finish_work to be finished 
> execution before return from drm_sched_job_recovery  , but I think it will 
> waste too much time.
>   3), to delete the guilty job from the ring_mirror_list directly 
> after this job is processed in amdgpu_job_run().amdgpu_job_run is the 
> final entrance for a job to be submitted into the ring, if something wrong 
> with this job:(finished->error < 0), then this job should be  
>  never be submitted again, and so should be deleted from the recovery list.
> I choose 3).
>
>
> I am going to make a new patch as below instead of the original one
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index e0af44f..aaf697f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -208,6 +208,7 @@ static struct dma_fence *amdgpu_job_dependency(struct 
> drm_sched_job *sched_job,  static struct dma_fence *amdgpu_job_run(struct 
> drm_sched_job *sched_job)  {

RE: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR

2018-11-12 Thread Liu, Monk
Hi Christian

Trigger's patch intends to fix the double signaling on dma_fence from below 
scenario:

1) Gpu_recovery(job = some job) invoked by guest TDR ==> the hang job's fence 
is set as -ECANCELD and fake signaled
2)GPU_recovery(job = NULL) again invoked by hypervisor or KFD, but the job of 
above step hasn't finished its "drm_sched_job_finish" routine which is kicked 
by queu_work(), due to the cause that the second gpu_recovery() is called prior 
to "drm_sched_job_finish" (so the job is not took out from the mirror_list,  so 
there will be duplicated dma_fence_set_error on that job and give you warning 
messeage)

/Monk

-Original Message-
From: amd-gfx  On Behalf Of Huang, 
Trigger
Sent: Monday, November 12, 2018 4:36 PM
To: Koenig, Christian ; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR

Hi Christian

Thanks for the correction before, I gave my explanation as below, would you 
help to check again, thanks in advance.

The key thing here is, if a job’s fence is signaled already, then call 
dma_fence_set_error to its fence will lead to a kernel warning call trace 
static inline void dma_fence_set_error(struct dma_fence *fence,
   int 
error) {
WARN_ON(test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, >flags));  
 # here is warning source
WARN_ON(error >= 0 || error < -MAX_ERRNO);
 
fence->error = error;
}
 
 
Then, let’s see a guilty job’s process flow in TDR. 

 
amdgpu_device_gpu_recover -> drm_sched_job_recovery  -> for each job in 
ring_mirror_list :
 1)dma_fence_set_error(_fence->finished, 
-ECANCELED) if this job is guilty
 2)amdgpu_job_run: the guilty job will be skipped, 
and not submitted into to ring
 3)drm_sched_process_job(guilty_job) -> 
drm_sched_fence_finished -> dma_fence_signal(>finished) -> 
drm_sched_job_finish_cb -> schedule_work(>finish_work);
 Later, finish_work’s callback function,  
drm_sched_job_finish, will be called by work queue, and guilty job will finally 
be deleted from ring_mirror_list.
But sometimes, amdgpu_device_gpu_recover will be called(from host FLR 
interrupt or KFD) again immediately, and at this moment, finish_work  is not 
scheduled by work queue, and drm_sched_job_finish for the guilty job is not 
called yet, so this guilty job is still in the   ring_mirror_list. But 
remember, the guilty job’s fence was signaled before (by 
drm_sched_fence_finished), followed by the TDR process, dma_fence_set_error 
will be called again to this guilty, then cause the kernel warning call trace.
 
So the root cause is, the finish_work for each job  is not scheduled quickly 
enough for GPU TDR scenario.  
Three ideas:
1), Instead of using the system global work queue, maybe we can 
create a dedicate work queue to manage the job finish things, but still can’t 
guarantee its finish_work’s execution in GPU TDR scenario
2), Wait for all the guilty jobs’ finish_work to be finished 
execution before return from drm_sched_job_recovery  , but I think it will 
waste too much time.
3), to delete the guilty job from the ring_mirror_list directly 
after this job is processed in amdgpu_job_run().amdgpu_job_run is the 
final entrance for a job to be submitted into the ring, if something wrong with 
this job:(finished->error < 0), then this job should be   never 
be submitted again, and so should be deleted from the recovery list.
I choose 3).


I am going to make a new patch as below instead of the original one

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index e0af44f..aaf697f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -208,6 +208,7 @@ static struct dma_fence *amdgpu_job_dependency(struct 
drm_sched_job *sched_job,  static struct dma_fence *amdgpu_job_run(struct 
drm_sched_job *sched_job)  {
struct amdgpu_ring *ring = to_amdgpu_ring(sched_job->sched);
+   struct drm_gpu_scheduler *sched = sched_job->sched;
struct dma_fence *fence = NULL, *finished;
struct amdgpu_job *job;
int r;
@@ -223,7 +224,10 @@ static struct dma_fence *amdgpu_job_run(struct 
drm_sched_job *sched_job)
dma_fence_set_error(finished, -ECANCELED);/* skip IB as well if 
VRAM lost */

if (finished->error < 0) {  
-   DRM_INFO("Skip scheduling IBs!\n");
+   DRM_INFO("Skip scheduling IBs and delete job from recovery 

RE: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR

2018-11-12 Thread Huang, Trigger
Hi Christian

Thanks for the correction before, I gave my explanation as below, would you 
help to check again, thanks in advance.

The key thing here is, if a job’s fence is signaled already, then call 
dma_fence_set_error to its fence will lead to a kernel warning call trace
static inline void dma_fence_set_error(struct dma_fence *fence,
   int 
error)
{
WARN_ON(test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, >flags));  
 # here is warning source
WARN_ON(error >= 0 || error < -MAX_ERRNO);
 
fence->error = error;
}
 
 
Then, let’s see a guilty job’s process flow in TDR. 

 
amdgpu_device_gpu_recover -> drm_sched_job_recovery  -> for each job in 
ring_mirror_list :
 1)dma_fence_set_error(_fence->finished, 
-ECANCELED) if this job is guilty
 2)amdgpu_job_run: the guilty job will be skipped, 
and not submitted into to ring
 3)drm_sched_process_job(guilty_job) -> 
drm_sched_fence_finished -> dma_fence_signal(>finished) -> 
drm_sched_job_finish_cb -> schedule_work(>finish_work);
 Later, finish_work’s callback function,  
drm_sched_job_finish, will be called by work queue, and guilty job will finally 
be deleted from ring_mirror_list.
But sometimes, amdgpu_device_gpu_recover will be called(from host FLR 
interrupt or KFD) again immediately, and at this moment, finish_work  is not 
scheduled by work queue, and drm_sched_job_finish for the guilty job is not 
called yet, so this guilty job is still in the   ring_mirror_list. But 
remember, the guilty job’s fence was signaled before (by 
drm_sched_fence_finished), followed by the TDR process, dma_fence_set_error 
will be called again to this guilty, then cause the kernel warning call trace.
 
So the root cause is, the finish_work for each job  is not scheduled quickly 
enough for GPU TDR scenario.  
Three ideas:
1), Instead of using the system global work queue, maybe we can 
create a dedicate work queue to manage the job finish things, but still can’t 
guarantee its finish_work’s execution in GPU TDR scenario
2), Wait for all the guilty jobs’ finish_work to be finished 
execution before return from drm_sched_job_recovery  , but I think it will 
waste too much time.
3), to delete the guilty job from the ring_mirror_list directly 
after this job is processed in amdgpu_job_run().amdgpu_job_run is the 
final entrance for a job to be submitted into the ring, if something wrong with 
this job:(finished->error < 0), then this job should be   never 
be submitted again, and so should be deleted from the recovery list.
I choose 3).


I am going to make a new patch as below instead of the original one

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index e0af44f..aaf697f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -208,6 +208,7 @@ static struct dma_fence *amdgpu_job_dependency(struct 
drm_sched_job *sched_job,
 static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
 {
struct amdgpu_ring *ring = to_amdgpu_ring(sched_job->sched);
+   struct drm_gpu_scheduler *sched = sched_job->sched;
struct dma_fence *fence = NULL, *finished;
struct amdgpu_job *job;
int r;
@@ -223,7 +224,10 @@ static struct dma_fence *amdgpu_job_run(struct 
drm_sched_job *sched_job)
dma_fence_set_error(finished, -ECANCELED);/* skip IB as well if 
VRAM lost */

if (finished->error < 0) {  
-   DRM_INFO("Skip scheduling IBs!\n");
+   DRM_INFO("Skip scheduling IBs and delete job from recovery 
list!\n");
+   spin_lock(>job_list_lock);
+   list_del_init(_job->node);
+   spin_unlock(>job_list_lock);
} else {
r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs, job,
   );


Thanks,
Trigger

-Original Message-
From: Christian König  
Sent: Friday, November 09, 2018 8:18 PM
To: Huang, Trigger ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR

Am 09.11.18 um 13:10 schrieb Trigger Huang:
> A bad job is the one triggered TDR. In the recovery process, its fence 
> will be signaled and as a result, the work behind will be scheduled to 
> delete it from the mirror list, but if the TDR process is invoked 
> before the work's execution, then the call dma_fence_set_error to its 
> fence in TDR proc

Re: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR

2018-11-09 Thread Christian König

Am 09.11.18 um 13:10 schrieb Trigger Huang:

A bad job is the one triggered TDR. In the recovery process, its fence
will be signaled and as a result, the work behind will be scheduled
to delete it from the mirror list, but if the TDR process is invoked
before the work's execution, then the call dma_fence_set_error to its
fence in TDR process will lead to kernel warning trace:

[  143.033605] WARNING: CPU: 2 PID: 53 at ./include/linux/dma-fence.h:437 
amddrm_sched_job_recovery+0x1af/0x1c0 [amd_sched]
kernel: [  143.033606] Modules linked in: amdgpu(OE) amdchash(OE) amdttm(OE) 
amd_sched(OE) amdkcl(OE) amd_iommu_v2 drm_kms_helper drm i2c_algo_bit 
fb_sys_fops syscopyarea sysfillrect sysimgblt kvm_intel kvm irqbypass 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 
snd_hda_codec_generic crypto_simd glue_helper cryptd snd_hda_intel 
snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event 
snd_rawmidi snd_seq joydev snd_seq_device snd_timer snd soundcore binfmt_misc 
input_leds mac_hid serio_raw nfsd auth_rpcgss nfs_acl lockd grace sunrpc 
sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 8139too 
floppy psmouse 8139cp mii i2c_piix4 pata_acpi
[  143.033649] CPU: 2 PID: 53 Comm: kworker/2:1 Tainted: G   OE
4.15.0-20-generic #21-Ubuntu
[  143.033650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  143.033653] Workqueue: events drm_sched_job_timedout [amd_sched]
[  143.033656] RIP: 0010:amddrm_sched_job_recovery+0x1af/0x1c0 [amd_sched]
[  143.033657] RSP: 0018:a9f880fe7d48 EFLAGS: 00010202
[  143.033659] RAX: 0007 RBX: 9b98f2b24c00 RCX: 9b98efef4f08
[  143.033660] RDX: 9b98f2b27400 RSI: 9b98f2b24c50 RDI: 9b98efef4f18
[  143.033660] RBP: a9f880fe7d98 R08: 0001 R09: 02b6
[  143.033661] R10:  R11:  R12: 9b98efef3430
[  143.033662] R13: 9b98efef4d80 R14: 9b98efef4e98 R15: 9b98eaf91c00
[  143.033663] FS:  () GS:9b98ffd0() 
knlGS:
[  143.033664] CS:  0010 DS:  ES:  CR0: 80050033
[  143.033665] CR2: 7fc49c96d470 CR3: 1400a005 CR4: 003606e0
[  143.033669] DR0:  DR1:  DR2: 
[  143.033669] DR3:  DR6: fffe0ff0 DR7: 0400
[  143.033670] Call Trace:
[  143.033744]  amdgpu_device_gpu_recover+0x144/0x820 [amdgpu]
[  143.033788]  amdgpu_job_timedout+0x9b/0xa0 [amdgpu]
[  143.033791]  drm_sched_job_timedout+0xcc/0x150 [amd_sched]
[  143.033795]  process_one_work+0x1de/0x410
[  143.033797]  worker_thread+0x32/0x410
[  143.033799]  kthread+0x121/0x140
[  143.033801]  ? process_one_work+0x410/0x410
[  143.033803]  ? kthread_create_worker_on_cpu+0x70/0x70
[  143.033806]  ret_from_fork+0x35/0x40

So just delete the bad job from mirror list directly after signal its fence

Signed-off-by: Trigger Huang 
---
  drivers/gpu/drm/scheduler/sched_main.c | 18 +-
  1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 18ebbb0..b2832fb 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -228,7 +228,10 @@ static void drm_sched_job_finish(struct work_struct *work)
  
  	spin_lock(>job_list_lock);

/* remove job from ring_mirror_list */
-   list_del(_job->node);
+   /* This node may already has been deleted in job recovery */
+   if (s_job->node.next != &(s_job->node))
+   list_del_init(_job->node);
+


Well that doesn't make any sense at all. If the list_head is already 
empty (pointing to itself) list_del is a no-op.



/* queue TDR for next job */
drm_sched_start_timeout(sched);
spin_unlock(>job_list_lock);
@@ -394,6 +397,19 @@ void drm_sched_job_recovery(struct drm_gpu_scheduler 
*sched)
drm_sched_process_job(NULL, _fence->cb);
}
spin_lock(>job_list_lock);
+   /**
+* Normally the 'bad' job will be deleted by its finish_work
+* after signal its fence, but sometimes if a GPU recovery is
+* invoked before this work finished execution (for example,
+* KFD/Host triggered the GPU reset when the current one is
+* on-going), then the 'bad' job may will be processed again,
+* which is definitely no necessary, and also will cause a lot
+* of warning call traces when this job is set by
+* 'dma_fence_set_error' because it has already been signaled
+*/
+   if ((s_fence->finished.error < 0)
+   && (s_job->node.next != &(s_job->node)))
+   list_del_init(_job->node);


That strongly