Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

2022-09-20 Thread Christian König

Hi Victor,

yes that behavior is mandatory. We can't rely on all engines always 
sending valid fence numbers.


Regards,
Christian.

Am 20.09.22 um 12:22 schrieb Zhao, Victor:

[AMD Official Use Only - General]

Hi @Koenig, Christian,

About the change 'drm/amdgpu: sanitize fence numbers', do you know if this vce 
issue still exists? Can we change fence process back?

Hi @Grodzovsky, Andrey,

So looks like close irq is possibly the most appropriate fix for this issue for 
now? Please help review this part.


Thanks,
Victor



-Original Message-
From: Grodzovsky, Andrey 
Sent: Monday, September 19, 2022 11:04 PM
To: Zhao, Victor ; Koenig, Christian 
; amd-gfx@lists.freedesktop.org
Cc: Deng, Emily 
Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

I don't know if issue still exist but it worth checking with Christian who 
wrote this patch.

Andrey


On 2022-09-16 23:31, Zhao, Victor wrote:

[AMD Official Use Only - General]

Hi Andrey,

Yes, moving irq disable can fix the issue. Change in amdgpu_fence_process is 
just want to make sure driver can correct itself from an overflow situation. 
Didn’t know about the previous issue there.

Do you know if the issue still exists? Or is it on VCE only?


Thanks,
Victor



-Original Message-
From: Grodzovsky, Andrey 
Sent: Friday, September 16, 2022 9:50 PM
To: Koenig, Christian ; Zhao, Victor
; amd-gfx@lists.freedesktop.org
Cc: Deng, Emily 
Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow


On 2022-09-16 01:18, Christian König wrote:

Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky:

On 2022-09-15 15:26, Christian König wrote:

Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:

On 2022-09-15 06:09, Zhao, Victor wrote:

[AMD Official Use Only - General]

Hi Christian,

The test sequence is executing a compute engine hang while
running a lot of containers submitting gfx jobs. We have advanced
tdr mode and mode2 reset enabled on driver.
When a compute hang job timeout happens, the 2 jobs on the gfx
pending list maybe signaled after drm_sched_stop. So they will
not be removed from pending list but have the
DMA_FENCE_FLAG_SIGNALED_BIT set.
At the amdgpu_device_recheck_guilty_jobs step, the first job will
be rerun and removed from pending list.
At the resubmit setp, the second job (with signaled bit) will be
resubmitted. Since it still has signaled bit, drm_sched_job_done
will be called directly. This decrease the hw_rq_count which
allows more jobs emitted but did not clean fence_drv rcu ptr.
This results in an overflow in the fence_drv. Since we will use
num_fences_mask in amdgpu_fence_process, when overflow happens,
the signal of some job will be skipped which result in an
infinite wait for the fence_drv rcu ptr.

So close irq before sched_stop could avoid signal jobs after
drm_sched_stop. And signal job one by one in fence_process
instead of using a mask will handle the overflow situation.

Another fix could be skip submitting jobs which already signaled
during resubmit stage, which may look cleaner.

Please help give some advice.

How about the code bellow  instead ? The real problem is that we
reuse a dma fence twice which is not according to fma fence
design, so maybe this can help ?


diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 8adeb7469f1e..033f0ae16784 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring
*ring, struct dma_fence **f, struct amd
      if (job && job->job_run_counter) {
      /* reinit seq for resubmitted jobs */
      fence->seqno = seq;
+
+   /* For resubmitted job clear the singled bit */
+   celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT,
&fence->flags);
+

Upstream will pretty much kill you for that.

Re-setting a fence from a signaled to an unsignaled state is a
massive no-go.

Christian.

Is it worse then doing fence->seqno = seq; ? This is already a huge
hack , no ?

No, it's as equally bad. I don't think we can do either.

Christian.

And all those ugly hack are there because we reuse a dma_fence (hw_fence 
embedded into the job) and correct me if I am wrong but I don't think dma_fence 
is ever supposed to be reused.

So maybe like Victor suggested we should move close and flush irq before 
sched_stop - this in my opinion should solve the issue, but Victor - why then 
you still need the change in amdgpu_fence_process ? You will not have the 
overflow situation because by moving irq_disable before stop any job that 
signaled will be removed from the scheduler pending list anyway. Also not that 
this change reverts 'drm/amdgpu: sanitize fence numbers' and could reintroduce 
that bug.

Andrey



Andrey



      /* TO be inline with external fence creation and
other drivers */
    

RE: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

2022-09-20 Thread Zhao, Victor
[AMD Official Use Only - General]

Hi @Koenig, Christian,

About the change 'drm/amdgpu: sanitize fence numbers', do you know if this vce 
issue still exists? Can we change fence process back?

Hi @Grodzovsky, Andrey,

So looks like close irq is possibly the most appropriate fix for this issue for 
now? Please help review this part.


Thanks,
Victor



-Original Message-
From: Grodzovsky, Andrey  
Sent: Monday, September 19, 2022 11:04 PM
To: Zhao, Victor ; Koenig, Christian 
; amd-gfx@lists.freedesktop.org
Cc: Deng, Emily 
Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

I don't know if issue still exist but it worth checking with Christian who 
wrote this patch.

Andrey


On 2022-09-16 23:31, Zhao, Victor wrote:
> [AMD Official Use Only - General]
>
> Hi Andrey,
>
> Yes, moving irq disable can fix the issue. Change in amdgpu_fence_process is 
> just want to make sure driver can correct itself from an overflow situation. 
> Didn’t know about the previous issue there.
>
> Do you know if the issue still exists? Or is it on VCE only?
>
>
> Thanks,
> Victor
>
>
>
> -Original Message-
> From: Grodzovsky, Andrey 
> Sent: Friday, September 16, 2022 9:50 PM
> To: Koenig, Christian ; Zhao, Victor 
> ; amd-gfx@lists.freedesktop.org
> Cc: Deng, Emily 
> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>
>
> On 2022-09-16 01:18, Christian König wrote:
>> Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky:
>>> On 2022-09-15 15:26, Christian König wrote:
>>>> Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:
>>>>> On 2022-09-15 06:09, Zhao, Victor wrote:
>>>>>> [AMD Official Use Only - General]
>>>>>>
>>>>>> Hi Christian,
>>>>>>
>>>>>> The test sequence is executing a compute engine hang while 
>>>>>> running a lot of containers submitting gfx jobs. We have advanced 
>>>>>> tdr mode and mode2 reset enabled on driver.
>>>>>> When a compute hang job timeout happens, the 2 jobs on the gfx 
>>>>>> pending list maybe signaled after drm_sched_stop. So they will 
>>>>>> not be removed from pending list but have the 
>>>>>> DMA_FENCE_FLAG_SIGNALED_BIT set.
>>>>>> At the amdgpu_device_recheck_guilty_jobs step, the first job will 
>>>>>> be rerun and removed from pending list.
>>>>>> At the resubmit setp, the second job (with signaled bit) will be 
>>>>>> resubmitted. Since it still has signaled bit, drm_sched_job_done 
>>>>>> will be called directly. This decrease the hw_rq_count which 
>>>>>> allows more jobs emitted but did not clean fence_drv rcu ptr.
>>>>>> This results in an overflow in the fence_drv. Since we will use 
>>>>>> num_fences_mask in amdgpu_fence_process, when overflow happens, 
>>>>>> the signal of some job will be skipped which result in an 
>>>>>> infinite wait for the fence_drv rcu ptr.
>>>>>>
>>>>>> So close irq before sched_stop could avoid signal jobs after 
>>>>>> drm_sched_stop. And signal job one by one in fence_process 
>>>>>> instead of using a mask will handle the overflow situation.
>>>>>>
>>>>>> Another fix could be skip submitting jobs which already signaled 
>>>>>> during resubmit stage, which may look cleaner.
>>>>>>
>>>>>> Please help give some advice.
>>>>>
>>>>> How about the code bellow  instead ? The real problem is that we 
>>>>> reuse a dma fence twice which is not according to fma fence 
>>>>> design, so maybe this can help ?
>>>>>
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> index 8adeb7469f1e..033f0ae16784 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> @@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring 
>>>>> *ring, struct dma_fence **f, struct amd
>>>>>      if (job && job->job_run_counter) {
>>>>>      /* reinit seq for resubmitted jobs */
>>>>>      fence->seqno = seq;
>>>>> +
>>>>> +   /* For resubmitted job clear the singled bit */
>>>>> +   celar_bit(DMA

Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

2022-09-19 Thread Andrey Grodzovsky
I don't know if issue still exist but it worth checking with Christian 
who wrote this patch.


Andrey


On 2022-09-16 23:31, Zhao, Victor wrote:

[AMD Official Use Only - General]

Hi Andrey,

Yes, moving irq disable can fix the issue. Change in amdgpu_fence_process is 
just want to make sure driver can correct itself from an overflow situation. 
Didn’t know about the previous issue there.

Do you know if the issue still exists? Or is it on VCE only?


Thanks,
Victor



-Original Message-
From: Grodzovsky, Andrey 
Sent: Friday, September 16, 2022 9:50 PM
To: Koenig, Christian ; Zhao, Victor 
; amd-gfx@lists.freedesktop.org
Cc: Deng, Emily 
Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow


On 2022-09-16 01:18, Christian König wrote:

Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky:

On 2022-09-15 15:26, Christian König wrote:

Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:

On 2022-09-15 06:09, Zhao, Victor wrote:

[AMD Official Use Only - General]

Hi Christian,

The test sequence is executing a compute engine hang while running
a lot of containers submitting gfx jobs. We have advanced tdr mode
and mode2 reset enabled on driver.
When a compute hang job timeout happens, the 2 jobs on the gfx
pending list maybe signaled after drm_sched_stop. So they will not
be removed from pending list but have the
DMA_FENCE_FLAG_SIGNALED_BIT set.
At the amdgpu_device_recheck_guilty_jobs step, the first job will
be rerun and removed from pending list.
At the resubmit setp, the second job (with signaled bit) will be
resubmitted. Since it still has signaled bit, drm_sched_job_done
will be called directly. This decrease the hw_rq_count which
allows more jobs emitted but did not clean fence_drv rcu ptr.
This results in an overflow in the fence_drv. Since we will use
num_fences_mask in amdgpu_fence_process, when overflow happens,
the signal of some job will be skipped which result in an infinite
wait for the fence_drv rcu ptr.

So close irq before sched_stop could avoid signal jobs after
drm_sched_stop. And signal job one by one in fence_process instead
of using a mask will handle the overflow situation.

Another fix could be skip submitting jobs which already signaled
during resubmit stage, which may look cleaner.

Please help give some advice.


How about the code bellow  instead ? The real problem is that we
reuse a dma fence twice which is not according to fma fence design,
so maybe this can help ?


diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 8adeb7469f1e..033f0ae16784 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring
*ring, struct dma_fence **f, struct amd
     if (job && job->job_run_counter) {
     /* reinit seq for resubmitted jobs */
     fence->seqno = seq;
+
+   /* For resubmitted job clear the singled bit */
+   celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT,
&fence->flags);
+

Upstream will pretty much kill you for that.

Re-setting a fence from a signaled to an unsignaled state is a
massive no-go.

Christian.


Is it worse then doing fence->seqno = seq; ? This is already a huge
hack , no ?

No, it's as equally bad. I don't think we can do either.

Christian.


And all those ugly hack are there because we reuse a dma_fence (hw_fence 
embedded into the job) and correct me if I am wrong but I don't think dma_fence 
is ever supposed to be reused.

So maybe like Victor suggested we should move close and flush irq before 
sched_stop - this in my opinion should solve the issue, but Victor - why then 
you still need the change in amdgpu_fence_process ? You will not have the 
overflow situation because by moving irq_disable before stop any job that 
signaled will be removed from the scheduler pending list anyway. Also not that 
this change reverts 'drm/amdgpu: sanitize fence numbers' and could reintroduce 
that bug.

Andrey



Andrey



     /* TO be inline with external fence creation and
other drivers */
     dma_fence_get(fence);
     } else {


Andrey




Thanks,
Victor



-Original Message-
From: Koenig, Christian 
Sent: Thursday, September 15, 2022 2:32 PM
To: Zhao, Victor ;
amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey

Cc: Deng, Emily 
Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by
overflow



Am 15.09.22 um 06:02 schrieb Zhao, Victor:

[AMD Official Use Only - General]

Ping.

Hi @Koenig, Christian and @Grodzovsky, Andrey,

We found some reset related issues during stress test on the
sequence. Please help give some comments.


Thanks,
Victor



-Original Message-
From: Victor Zhao 
Sent: Wednesday, September 14, 2022 6:10 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deng, Emily ; Grodzovsky, Andrey
; Zhao, Victor 
Subject: [PATCH 1/2] drm/amdgpu: fix deadlo

RE: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

2022-09-16 Thread Zhao, Victor
[AMD Official Use Only - General]

Hi Andrey,

Yes, moving irq disable can fix the issue. Change in amdgpu_fence_process is 
just want to make sure driver can correct itself from an overflow situation. 
Didn’t know about the previous issue there.

Do you know if the issue still exists? Or is it on VCE only?


Thanks,
Victor



-Original Message-
From: Grodzovsky, Andrey  
Sent: Friday, September 16, 2022 9:50 PM
To: Koenig, Christian ; Zhao, Victor 
; amd-gfx@lists.freedesktop.org
Cc: Deng, Emily 
Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow


On 2022-09-16 01:18, Christian König wrote:
> Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky:
>>
>> On 2022-09-15 15:26, Christian König wrote:
>>> Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:
>>>>
>>>> On 2022-09-15 06:09, Zhao, Victor wrote:
>>>>> [AMD Official Use Only - General]
>>>>>
>>>>> Hi Christian,
>>>>>
>>>>> The test sequence is executing a compute engine hang while running 
>>>>> a lot of containers submitting gfx jobs. We have advanced tdr mode 
>>>>> and mode2 reset enabled on driver.
>>>>> When a compute hang job timeout happens, the 2 jobs on the gfx 
>>>>> pending list maybe signaled after drm_sched_stop. So they will not 
>>>>> be removed from pending list but have the 
>>>>> DMA_FENCE_FLAG_SIGNALED_BIT set.
>>>>> At the amdgpu_device_recheck_guilty_jobs step, the first job will 
>>>>> be rerun and removed from pending list.
>>>>> At the resubmit setp, the second job (with signaled bit) will be 
>>>>> resubmitted. Since it still has signaled bit, drm_sched_job_done 
>>>>> will be called directly. This decrease the hw_rq_count which 
>>>>> allows more jobs emitted but did not clean fence_drv rcu ptr.
>>>>> This results in an overflow in the fence_drv. Since we will use 
>>>>> num_fences_mask in amdgpu_fence_process, when overflow happens, 
>>>>> the signal of some job will be skipped which result in an infinite 
>>>>> wait for the fence_drv rcu ptr.
>>>>>
>>>>> So close irq before sched_stop could avoid signal jobs after 
>>>>> drm_sched_stop. And signal job one by one in fence_process instead 
>>>>> of using a mask will handle the overflow situation.
>>>>>
>>>>> Another fix could be skip submitting jobs which already signaled 
>>>>> during resubmit stage, which may look cleaner.
>>>>>
>>>>> Please help give some advice.
>>>>
>>>>
>>>> How about the code bellow  instead ? The real problem is that we 
>>>> reuse a dma fence twice which is not according to fma fence design, 
>>>> so maybe this can help ?
>>>>
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> index 8adeb7469f1e..033f0ae16784 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> @@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring 
>>>> *ring, struct dma_fence **f, struct amd
>>>>     if (job && job->job_run_counter) {
>>>>     /* reinit seq for resubmitted jobs */
>>>>     fence->seqno = seq;
>>>> +
>>>> +   /* For resubmitted job clear the singled bit */
>>>> +   celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT,
>>>> &fence->flags);
>>>> +
>>>
>>> Upstream will pretty much kill you for that.
>>>
>>> Re-setting a fence from a signaled to an unsignaled state is a 
>>> massive no-go.
>>>
>>> Christian.
>>
>>
>> Is it worse then doing fence->seqno = seq; ? This is already a huge 
>> hack , no ?
>
> No, it's as equally bad. I don't think we can do either.
>
> Christian.


And all those ugly hack are there because we reuse a dma_fence (hw_fence 
embedded into the job) and correct me if I am wrong but I don't think dma_fence 
is ever supposed to be reused.

So maybe like Victor suggested we should move close and flush irq before 
sched_stop - this in my opinion should solve the issue, but Victor - why then 
you still need the change in amdgpu_fence_process ? You will not have the 
overflow situation because by moving irq_disable before stop any job that 
signaled will be removed from th

Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

2022-09-16 Thread Christian König




Am 16.09.22 um 15:50 schrieb Andrey Grodzovsky:


On 2022-09-16 01:18, Christian König wrote:

Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky:


On 2022-09-15 15:26, Christian König wrote:

Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:


On 2022-09-15 06:09, Zhao, Victor wrote:

[AMD Official Use Only - General]

Hi Christian,

The test sequence is executing a compute engine hang while 
running a lot of containers submitting gfx jobs. We have advanced 
tdr mode and mode2 reset enabled on driver.
When a compute hang job timeout happens, the 2 jobs on the gfx 
pending list maybe signaled after drm_sched_stop. So they will 
not be removed from pending list but have the 
DMA_FENCE_FLAG_SIGNALED_BIT set.
At the amdgpu_device_recheck_guilty_jobs step, the first job will 
be rerun and removed from pending list.
At the resubmit setp, the second job (with signaled bit) will be 
resubmitted. Since it still has signaled bit, drm_sched_job_done 
will be called directly. This decrease the hw_rq_count which 
allows more jobs emitted but did not clean fence_drv rcu ptr.
This results in an overflow in the fence_drv. Since we will use 
num_fences_mask in amdgpu_fence_process, when overflow happens, 
the signal of some job will be skipped which result in an 
infinite wait for the fence_drv rcu ptr.


So close irq before sched_stop could avoid signal jobs after 
drm_sched_stop. And signal job one by one in fence_process 
instead of using a mask will handle the overflow situation.


Another fix could be skip submitting jobs which already signaled 
during resubmit stage, which may look cleaner.


Please help give some advice.



How about the code bellow  instead ? The real problem is that we 
reuse a dma fence twice which is not according to fma fence 
design, so maybe this can help ?



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

index 8adeb7469f1e..033f0ae16784 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring 
*ring, struct dma_fence **f, struct amd

    if (job && job->job_run_counter) {
    /* reinit seq for resubmitted jobs */
    fence->seqno = seq;
+
+   /* For resubmitted job clear the singled bit */
+   celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT, 
&fence->flags);

+


Upstream will pretty much kill you for that.

Re-setting a fence from a signaled to an unsignaled state is a 
massive no-go.


Christian.



Is it worse then doing fence->seqno = seq; ? This is already a huge 
hack , no ?


No, it's as equally bad. I don't think we can do either.

Christian.



And all those ugly hack are there because we reuse a dma_fence 
(hw_fence embedded into the job) and correct me if I am wrong

but I don't think dma_fence is ever supposed to be reused.


Exactly that, yes.

Christian.



So maybe like Victor suggested we should move close and flush irq 
before sched_stop - this in my opinion should solve the issue, but 
Victor - why then you still need the change in amdgpu_fence_process ? 
You will not have the overflow situation because by moving irq_disable 
before stop any job that signaled will be removed from the scheduler 
pending list anyway. Also not that this change reverts 'drm/amdgpu: 
sanitize fence numbers' and could reintroduce that bug.


Andrey






Andrey






    /* TO be inline with external fence creation and 
other drivers */

    dma_fence_get(fence);
    } else {


Andrey





Thanks,
Victor



-Original Message-
From: Koenig, Christian 
Sent: Thursday, September 15, 2022 2:32 PM
To: Zhao, Victor ; 
amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey 


Cc: Deng, Emily 
Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow



Am 15.09.22 um 06:02 schrieb Zhao, Victor:

[AMD Official Use Only - General]

Ping.

Hi @Koenig, Christian and @Grodzovsky, Andrey,

We found some reset related issues during stress test on the 
sequence. Please help give some comments.



Thanks,
Victor



-Original Message-
From: Victor Zhao 
Sent: Wednesday, September 14, 2022 6:10 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deng, Emily ; Grodzovsky, Andrey
; Zhao, Victor 
Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

[background]
For a gpu recovery caused by a hang on one ring (e.g. compute), 
jobs from another ring (e.g. gfx) may continue signaling during 
drm_sched_stop stage. The signal bit will not be cleared.


At the resubmit stage after recovery, the job with hw fence 
signaled bit set will call job done directly instead go through 
fence process.
This makes the hw_rq_count decrease but rcu fence pointer not 
cleared yet.


Then overflow happens in the fence driver slots and some jobs 
may be skipped and leave the rcu pointer not cleared which makes 
an infinite wait for th

Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

2022-09-16 Thread Andrey Grodzovsky



On 2022-09-16 01:18, Christian König wrote:

Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky:


On 2022-09-15 15:26, Christian König wrote:

Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:


On 2022-09-15 06:09, Zhao, Victor wrote:

[AMD Official Use Only - General]

Hi Christian,

The test sequence is executing a compute engine hang while running 
a lot of containers submitting gfx jobs. We have advanced tdr mode 
and mode2 reset enabled on driver.
When a compute hang job timeout happens, the 2 jobs on the gfx 
pending list maybe signaled after drm_sched_stop. So they will not 
be removed from pending list but have the 
DMA_FENCE_FLAG_SIGNALED_BIT set.
At the amdgpu_device_recheck_guilty_jobs step, the first job will 
be rerun and removed from pending list.
At the resubmit setp, the second job (with signaled bit) will be 
resubmitted. Since it still has signaled bit, drm_sched_job_done 
will be called directly. This decrease the hw_rq_count which 
allows more jobs emitted but did not clean fence_drv rcu ptr.
This results in an overflow in the fence_drv. Since we will use 
num_fences_mask in amdgpu_fence_process, when overflow happens, 
the signal of some job will be skipped which result in an infinite 
wait for the fence_drv rcu ptr.


So close irq before sched_stop could avoid signal jobs after 
drm_sched_stop. And signal job one by one in fence_process instead 
of using a mask will handle the overflow situation.


Another fix could be skip submitting jobs which already signaled 
during resubmit stage, which may look cleaner.


Please help give some advice.



How about the code bellow  instead ? The real problem is that we 
reuse a dma fence twice which is not according to fma fence design, 
so maybe this can help ?



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

index 8adeb7469f1e..033f0ae16784 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring 
*ring, struct dma_fence **f, struct amd

    if (job && job->job_run_counter) {
    /* reinit seq for resubmitted jobs */
    fence->seqno = seq;
+
+   /* For resubmitted job clear the singled bit */
+   celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT, 
&fence->flags);

+


Upstream will pretty much kill you for that.

Re-setting a fence from a signaled to an unsignaled state is a 
massive no-go.


Christian.



Is it worse then doing fence->seqno = seq; ? This is already a huge 
hack , no ?


No, it's as equally bad. I don't think we can do either.

Christian.



And all those ugly hack are there because we reuse a dma_fence (hw_fence 
embedded into the job) and correct me if I am wrong

but I don't think dma_fence is ever supposed to be reused.

So maybe like Victor suggested we should move close and flush irq before 
sched_stop - this in my opinion should solve the issue, but Victor - why 
then you still need the change in amdgpu_fence_process ? You will not 
have the overflow situation because by moving irq_disable before stop 
any job that signaled will be removed from the scheduler pending list 
anyway. Also not that this change reverts 'drm/amdgpu: sanitize fence 
numbers' and could reintroduce that bug.


Andrey






Andrey






    /* TO be inline with external fence creation and 
other drivers */

    dma_fence_get(fence);
    } else {


Andrey





Thanks,
Victor



-Original Message-
From: Koenig, Christian 
Sent: Thursday, September 15, 2022 2:32 PM
To: Zhao, Victor ; 
amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey 


Cc: Deng, Emily 
Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow



Am 15.09.22 um 06:02 schrieb Zhao, Victor:

[AMD Official Use Only - General]

Ping.

Hi @Koenig, Christian and @Grodzovsky, Andrey,

We found some reset related issues during stress test on the 
sequence. Please help give some comments.



Thanks,
Victor



-Original Message-
From: Victor Zhao 
Sent: Wednesday, September 14, 2022 6:10 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deng, Emily ; Grodzovsky, Andrey
; Zhao, Victor 
Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

[background]
For a gpu recovery caused by a hang on one ring (e.g. compute), 
jobs from another ring (e.g. gfx) may continue signaling during 
drm_sched_stop stage. The signal bit will not be cleared.


At the resubmit stage after recovery, the job with hw fence 
signaled bit set will call job done directly instead go through 
fence process.
This makes the hw_rq_count decrease but rcu fence pointer not 
cleared yet.


Then overflow happens in the fence driver slots and some jobs may 
be skipped and leave the rcu pointer not cleared which makes an 
infinite wait for the slot on the next fence emitted.


This infinite wait cause a job timeout on the emit

Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

2022-09-15 Thread Christian König

Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky:


On 2022-09-15 15:26, Christian König wrote:

Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:


On 2022-09-15 06:09, Zhao, Victor wrote:

[AMD Official Use Only - General]

Hi Christian,

The test sequence is executing a compute engine hang while running 
a lot of containers submitting gfx jobs. We have advanced tdr mode 
and mode2 reset enabled on driver.
When a compute hang job timeout happens, the 2 jobs on the gfx 
pending list maybe signaled after drm_sched_stop. So they will not 
be removed from pending list but have the 
DMA_FENCE_FLAG_SIGNALED_BIT set.
At the amdgpu_device_recheck_guilty_jobs step, the first job will 
be rerun and removed from pending list.
At the resubmit setp, the second job (with signaled bit) will be 
resubmitted. Since it still has signaled bit, drm_sched_job_done 
will be called directly. This decrease the hw_rq_count which allows 
more jobs emitted but did not clean fence_drv rcu ptr.
This results in an overflow in the fence_drv. Since we will use 
num_fences_mask in amdgpu_fence_process, when overflow happens, the 
signal of some job will be skipped which result in an infinite wait 
for the fence_drv rcu ptr.


So close irq before sched_stop could avoid signal jobs after 
drm_sched_stop. And signal job one by one in fence_process instead 
of using a mask will handle the overflow situation.


Another fix could be skip submitting jobs which already signaled 
during resubmit stage, which may look cleaner.


Please help give some advice.



How about the code bellow  instead ? The real problem is that we 
reuse a dma fence twice which is not according to fma fence design, 
so maybe this can help ?



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

index 8adeb7469f1e..033f0ae16784 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, 
struct dma_fence **f, struct amd

    if (job && job->job_run_counter) {
    /* reinit seq for resubmitted jobs */
    fence->seqno = seq;
+
+   /* For resubmitted job clear the singled bit */
+   celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT, &fence->flags);
+


Upstream will pretty much kill you for that.

Re-setting a fence from a signaled to an unsignaled state is a 
massive no-go.


Christian.



Is it worse then doing fence->seqno = seq; ? This is already a huge 
hack , no ?


No, it's as equally bad. I don't think we can do either.

Christian.



Andrey






    /* TO be inline with external fence creation and 
other drivers */

    dma_fence_get(fence);
    } else {


Andrey





Thanks,
Victor



-Original Message-
From: Koenig, Christian 
Sent: Thursday, September 15, 2022 2:32 PM
To: Zhao, Victor ; 
amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey 


Cc: Deng, Emily 
Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow



Am 15.09.22 um 06:02 schrieb Zhao, Victor:

[AMD Official Use Only - General]

Ping.

Hi @Koenig, Christian and @Grodzovsky, Andrey,

We found some reset related issues during stress test on the 
sequence. Please help give some comments.



Thanks,
Victor



-Original Message-
From: Victor Zhao 
Sent: Wednesday, September 14, 2022 6:10 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deng, Emily ; Grodzovsky, Andrey
; Zhao, Victor 
Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

[background]
For a gpu recovery caused by a hang on one ring (e.g. compute), 
jobs from another ring (e.g. gfx) may continue signaling during 
drm_sched_stop stage. The signal bit will not be cleared.


At the resubmit stage after recovery, the job with hw fence 
signaled bit set will call job done directly instead go through 
fence process.
This makes the hw_rq_count decrease but rcu fence pointer not 
cleared yet.


Then overflow happens in the fence driver slots and some jobs may 
be skipped and leave the rcu pointer not cleared which makes an 
infinite wait for the slot on the next fence emitted.


This infinite wait cause a job timeout on the emitting job. And 
driver will stuck at the its sched stop step because kthread_park 
cannot be done.


[how]
1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
before drm sched stop 2. handle all fences in fence process to aviod
skip when overflow happens

Signed-off-by: Victor Zhao 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 
+--- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  6 
+-

   2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 943c9e750575..c0cfae52f12b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4610,8 +4610,6 @

Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

2022-09-15 Thread Andrey Grodzovsky



On 2022-09-15 15:26, Christian König wrote:

Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:


On 2022-09-15 06:09, Zhao, Victor wrote:

[AMD Official Use Only - General]

Hi Christian,

The test sequence is executing a compute engine hang while running a 
lot of containers submitting gfx jobs. We have advanced tdr mode and 
mode2 reset enabled on driver.
When a compute hang job timeout happens, the 2 jobs on the gfx 
pending list maybe signaled after drm_sched_stop. So they will not 
be removed from pending list but have the 
DMA_FENCE_FLAG_SIGNALED_BIT set.
At the amdgpu_device_recheck_guilty_jobs step, the first job will be 
rerun and removed from pending list.
At the resubmit setp, the second job (with signaled bit) will be 
resubmitted. Since it still has signaled bit, drm_sched_job_done 
will be called directly. This decrease the hw_rq_count which allows 
more jobs emitted but did not clean fence_drv rcu ptr.
This results in an overflow in the fence_drv. Since we will use 
num_fences_mask in amdgpu_fence_process, when overflow happens, the 
signal of some job will be skipped which result in an infinite wait 
for the fence_drv rcu ptr.


So close irq before sched_stop could avoid signal jobs after 
drm_sched_stop. And signal job one by one in fence_process instead 
of using a mask will handle the overflow situation.


Another fix could be skip submitting jobs which already signaled 
during resubmit stage, which may look cleaner.


Please help give some advice.



How about the code bellow  instead ? The real problem is that we 
reuse a dma fence twice which is not according to fma fence design, 
so maybe this can help ?



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

index 8adeb7469f1e..033f0ae16784 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, 
struct dma_fence **f, struct amd

    if (job && job->job_run_counter) {
    /* reinit seq for resubmitted jobs */
    fence->seqno = seq;
+
+   /* For resubmitted job clear the singled bit */
+   celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT, &fence->flags);
+


Upstream will pretty much kill you for that.

Re-setting a fence from a signaled to an unsignaled state is a massive 
no-go.


Christian.



Is it worse then doing fence->seqno = seq; ? This is already a huge hack 
, no ?


Andrey






    /* TO be inline with external fence creation and 
other drivers */

    dma_fence_get(fence);
    } else {


Andrey





Thanks,
Victor



-Original Message-
From: Koenig, Christian 
Sent: Thursday, September 15, 2022 2:32 PM
To: Zhao, Victor ; 
amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey 


Cc: Deng, Emily 
Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow



Am 15.09.22 um 06:02 schrieb Zhao, Victor:

[AMD Official Use Only - General]

Ping.

Hi @Koenig, Christian and @Grodzovsky, Andrey,

We found some reset related issues during stress test on the 
sequence. Please help give some comments.



Thanks,
Victor



-Original Message-
From: Victor Zhao 
Sent: Wednesday, September 14, 2022 6:10 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deng, Emily ; Grodzovsky, Andrey
; Zhao, Victor 
Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

[background]
For a gpu recovery caused by a hang on one ring (e.g. compute), 
jobs from another ring (e.g. gfx) may continue signaling during 
drm_sched_stop stage. The signal bit will not be cleared.


At the resubmit stage after recovery, the job with hw fence 
signaled bit set will call job done directly instead go through 
fence process.
This makes the hw_rq_count decrease but rcu fence pointer not 
cleared yet.


Then overflow happens in the fence driver slots and some jobs may 
be skipped and leave the rcu pointer not cleared which makes an 
infinite wait for the slot on the next fence emitted.


This infinite wait cause a job timeout on the emitting job. And 
driver will stuck at the its sched stop step because kthread_park 
cannot be done.


[how]
1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
before drm sched stop 2. handle all fences in fence process to aviod
skip when overflow happens

Signed-off-by: Victor Zhao 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 
+---  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  6 
+-

   2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 943c9e750575..c0cfae52f12b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct 
amdgpu_device *adev,

   amdgpu_virt

Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

2022-09-15 Thread Christian König

Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:


On 2022-09-15 06:09, Zhao, Victor wrote:

[AMD Official Use Only - General]

Hi Christian,

The test sequence is executing a compute engine hang while running a 
lot of containers submitting gfx jobs. We have advanced tdr mode and 
mode2 reset enabled on driver.
When a compute hang job timeout happens, the 2 jobs on the gfx 
pending list maybe signaled after drm_sched_stop. So they will not be 
removed from pending list but have the DMA_FENCE_FLAG_SIGNALED_BIT set.
At the amdgpu_device_recheck_guilty_jobs step, the first job will be 
rerun and removed from pending list.
At the resubmit setp, the second job (with signaled bit) will be 
resubmitted. Since it still has signaled bit, drm_sched_job_done will 
be called directly. This decrease the hw_rq_count which allows more 
jobs emitted but did not clean fence_drv rcu ptr.
This results in an overflow in the fence_drv. Since we will use 
num_fences_mask in amdgpu_fence_process, when overflow happens, the 
signal of some job will be skipped which result in an infinite wait 
for the fence_drv rcu ptr.


So close irq before sched_stop could avoid signal jobs after 
drm_sched_stop. And signal job one by one in fence_process instead of 
using a mask will handle the overflow situation.


Another fix could be skip submitting jobs which already signaled 
during resubmit stage, which may look cleaner.


Please help give some advice.



How about the code bellow  instead ? The real problem is that we reuse 
a dma fence twice which is not according to fma fence design, so maybe 
this can help ?



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

index 8adeb7469f1e..033f0ae16784 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, 
struct dma_fence **f, struct amd

    if (job && job->job_run_counter) {
    /* reinit seq for resubmitted jobs */
    fence->seqno = seq;
+
+   /* For resubmitted job clear the singled bit */
+   celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT, &fence->flags);
+


Upstream will pretty much kill you for that.

Re-setting a fence from a signaled to an unsignaled state is a massive 
no-go.


Christian.



    /* TO be inline with external fence creation and other 
drivers */

    dma_fence_get(fence);
    } else {


Andrey





Thanks,
Victor



-Original Message-
From: Koenig, Christian 
Sent: Thursday, September 15, 2022 2:32 PM
To: Zhao, Victor ; 
amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey 


Cc: Deng, Emily 
Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow



Am 15.09.22 um 06:02 schrieb Zhao, Victor:

[AMD Official Use Only - General]

Ping.

Hi @Koenig, Christian and @Grodzovsky, Andrey,

We found some reset related issues during stress test on the 
sequence. Please help give some comments.



Thanks,
Victor



-Original Message-
From: Victor Zhao 
Sent: Wednesday, September 14, 2022 6:10 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deng, Emily ; Grodzovsky, Andrey
; Zhao, Victor 
Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

[background]
For a gpu recovery caused by a hang on one ring (e.g. compute), jobs 
from another ring (e.g. gfx) may continue signaling during 
drm_sched_stop stage. The signal bit will not be cleared.


At the resubmit stage after recovery, the job with hw fence signaled 
bit set will call job done directly instead go through fence process.
This makes the hw_rq_count decrease but rcu fence pointer not 
cleared yet.


Then overflow happens in the fence driver slots and some jobs may be 
skipped and leave the rcu pointer not cleared which makes an 
infinite wait for the slot on the next fence emitted.


This infinite wait cause a job timeout on the emitting job. And 
driver will stuck at the its sched stop step because kthread_park 
cannot be done.


[how]
1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
before drm sched stop 2. handle all fences in fence process to aviod
skip when overflow happens

Signed-off-by: Victor Zhao 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +---  
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  6 +-

   2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 943c9e750575..c0cfae52f12b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct 
amdgpu_device *adev,

   amdgpu_virt_fini_data_exchange(adev);
   }
   -    amdgpu_fence_driver_isr_toggle(adev, true);
-
   /* block all schedulers and reset given job's ring */
   for (i = 0;

Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

2022-09-15 Thread Andrey Grodzovsky

Had a typo - see bellow

On 2022-09-15 14:29, Andrey Grodzovsky wrote:


On 2022-09-15 06:09, Zhao, Victor wrote:

[AMD Official Use Only - General]

Hi Christian,

The test sequence is executing a compute engine hang while running a 
lot of containers submitting gfx jobs. We have advanced tdr mode and 
mode2 reset enabled on driver.
When a compute hang job timeout happens, the 2 jobs on the gfx 
pending list maybe signaled after drm_sched_stop. So they will not be 
removed from pending list but have the DMA_FENCE_FLAG_SIGNALED_BIT set.
At the amdgpu_device_recheck_guilty_jobs step, the first job will be 
rerun and removed from pending list.
At the resubmit setp, the second job (with signaled bit) will be 
resubmitted. Since it still has signaled bit, drm_sched_job_done will 
be called directly. This decrease the hw_rq_count which allows more 
jobs emitted but did not clean fence_drv rcu ptr.
This results in an overflow in the fence_drv. Since we will use 
num_fences_mask in amdgpu_fence_process, when overflow happens, the 
signal of some job will be skipped which result in an infinite wait 
for the fence_drv rcu ptr.


So close irq before sched_stop could avoid signal jobs after 
drm_sched_stop. And signal job one by one in fence_process instead of 
using a mask will handle the overflow situation.


Another fix could be skip submitting jobs which already signaled 
during resubmit stage, which may look cleaner.


Please help give some advice.



How about the code bellow  instead ? The real problem is that we reuse 
a dma fence twice which is not according to fma fence design, so maybe 
this can help ?



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

index 8adeb7469f1e..033f0ae16784 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, 
struct dma_fence **f, struct amd

    if (job && job->job_run_counter) {
    /* reinit seq for resubmitted jobs */
    fence->seqno = seq;
+
+   /* For resubmitted job clear the singled bit */
+   celar_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags);
+
    /* TO be inline with external fence creation and other 
drivers */

    dma_fence_get(fence);
    } else {


Andrey





Thanks,
Victor



-Original Message-
From: Koenig, Christian 
Sent: Thursday, September 15, 2022 2:32 PM
To: Zhao, Victor ; 
amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey 


Cc: Deng, Emily 
Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow



Am 15.09.22 um 06:02 schrieb Zhao, Victor:

[AMD Official Use Only - General]

Ping.

Hi @Koenig, Christian and @Grodzovsky, Andrey,

We found some reset related issues during stress test on the 
sequence. Please help give some comments.



Thanks,
Victor



-Original Message-
From: Victor Zhao 
Sent: Wednesday, September 14, 2022 6:10 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deng, Emily ; Grodzovsky, Andrey
; Zhao, Victor 
Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

[background]
For a gpu recovery caused by a hang on one ring (e.g. compute), jobs 
from another ring (e.g. gfx) may continue signaling during 
drm_sched_stop stage. The signal bit will not be cleared.


At the resubmit stage after recovery, the job with hw fence signaled 
bit set will call job done directly instead go through fence process.
This makes the hw_rq_count decrease but rcu fence pointer not 
cleared yet.


Then overflow happens in the fence driver slots and some jobs may be 
skipped and leave the rcu pointer not cleared which makes an 
infinite wait for the slot on the next fence emitted.


This infinite wait cause a job timeout on the emitting job. And 
driver will stuck at the its sched stop step because kthread_park 
cannot be done.


[how]
1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
before drm sched stop 2. handle all fences in fence process to aviod
skip when overflow happens

Signed-off-by: Victor Zhao 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +---  
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  6 +-

   2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 943c9e750575..c0cfae52f12b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct 
amdgpu_device *adev,

   amdgpu_virt_fini_data_exchange(adev);
   }
   -    amdgpu_fence_driver_isr_toggle(adev, true);
-
   /* block all schedulers and reset given job's ring */
   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
   struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6 
+5212,8 @@ int amdgpu_device_gpu_re

Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

2022-09-15 Thread Andrey Grodzovsky



On 2022-09-15 06:09, Zhao, Victor wrote:

[AMD Official Use Only - General]

Hi Christian,

The test sequence is executing a compute engine hang while running a lot of 
containers submitting gfx jobs. We have advanced tdr mode and mode2 reset 
enabled on driver.
When a compute hang job timeout happens, the 2 jobs on the gfx pending list 
maybe signaled after drm_sched_stop. So they will not be removed from pending 
list but have the DMA_FENCE_FLAG_SIGNALED_BIT set.
At the amdgpu_device_recheck_guilty_jobs step, the first job will be rerun and 
removed from pending list.
At the resubmit setp, the second job (with signaled bit) will be resubmitted. 
Since it still has signaled bit, drm_sched_job_done will be called directly. 
This decrease the hw_rq_count which allows more jobs emitted but did not clean 
fence_drv rcu ptr.
This results in an overflow in the fence_drv. Since we will use num_fences_mask 
in amdgpu_fence_process, when overflow happens, the signal of some job will be 
skipped which result in an infinite wait for the fence_drv rcu ptr.

So close irq before sched_stop could avoid signal jobs after drm_sched_stop. 
And signal job one by one in fence_process instead of using a mask will handle 
the overflow situation.

Another fix could be skip submitting jobs which already signaled during 
resubmit stage, which may look cleaner.

Please help give some advice.



How about the code bellow  instead ? The real problem is that we reuse a 
dma fence twice which is not according to fma fence design, so maybe 
this can help ?



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

index 8adeb7469f1e..033f0ae16784 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, 
struct dma_fence **f, struct amd

    if (job && job->job_run_counter) {
    /* reinit seq for resubmitted jobs */
    fence->seqno = seq;
+
+   /* For resubmitted job clear the singled bit */
+   celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT, &fence->flags);
+
    /* TO be inline with external fence creation and other 
drivers */

    dma_fence_get(fence);
    } else {


Andrey





Thanks,
Victor



-Original Message-
From: Koenig, Christian 
Sent: Thursday, September 15, 2022 2:32 PM
To: Zhao, Victor ; amd-gfx@lists.freedesktop.org; Grodzovsky, 
Andrey 
Cc: Deng, Emily 
Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow



Am 15.09.22 um 06:02 schrieb Zhao, Victor:

[AMD Official Use Only - General]

Ping.

Hi @Koenig, Christian and @Grodzovsky, Andrey,

We found some reset related issues during stress test on the sequence. Please 
help give some comments.


Thanks,
Victor



-Original Message-
From: Victor Zhao 
Sent: Wednesday, September 14, 2022 6:10 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deng, Emily ; Grodzovsky, Andrey
; Zhao, Victor 
Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

[background]
For a gpu recovery caused by a hang on one ring (e.g. compute), jobs from 
another ring (e.g. gfx) may continue signaling during drm_sched_stop stage. The 
signal bit will not be cleared.

At the resubmit stage after recovery, the job with hw fence signaled bit set 
will call job done directly instead go through fence process.
This makes the hw_rq_count decrease but rcu fence pointer not cleared yet.

Then overflow happens in the fence driver slots and some jobs may be skipped 
and leave the rcu pointer not cleared which makes an infinite wait for the slot 
on the next fence emitted.

This infinite wait cause a job timeout on the emitting job. And driver will 
stuck at the its sched stop step because kthread_park cannot be done.

[how]
1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
before drm sched stop 2. handle all fences in fence process to aviod
skip when overflow happens

Signed-off-by: Victor Zhao 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +---  
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  6 +-
   2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 943c9e750575..c0cfae52f12b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
*adev,
amdgpu_virt_fini_data_exchange(adev);
}
   
-	amdgpu_fence_driver_isr_toggle(adev, true);

-
/* block all schedulers and reset given job's ring */
for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6 +5212,8 
@@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
  amdgpu_de

Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

2022-09-15 Thread Christian König

Hi Victor,

the advanced tdr mode is not a feature we want to enable in production 
and keep for much longer.


So would that issue happen without this as well? If not it is rather 
questionable if we should look into fixing this in the first place.


Regards,
Christian.

Am 15.09.22 um 12:09 schrieb Zhao, Victor:

[AMD Official Use Only - General]

Hi Christian,

The test sequence is executing a compute engine hang while running a lot of 
containers submitting gfx jobs. We have advanced tdr mode and mode2 reset 
enabled on driver.
When a compute hang job timeout happens, the 2 jobs on the gfx pending list 
maybe signaled after drm_sched_stop. So they will not be removed from pending 
list but have the DMA_FENCE_FLAG_SIGNALED_BIT set.
At the amdgpu_device_recheck_guilty_jobs step, the first job will be rerun and 
removed from pending list.
At the resubmit setp, the second job (with signaled bit) will be resubmitted. 
Since it still has signaled bit, drm_sched_job_done will be called directly. 
This decrease the hw_rq_count which allows more jobs emitted but did not clean 
fence_drv rcu ptr.
This results in an overflow in the fence_drv. Since we will use num_fences_mask 
in amdgpu_fence_process, when overflow happens, the signal of some job will be 
skipped which result in an infinite wait for the fence_drv rcu ptr.

So close irq before sched_stop could avoid signal jobs after drm_sched_stop. 
And signal job one by one in fence_process instead of using a mask will handle 
the overflow situation.

Another fix could be skip submitting jobs which already signaled during 
resubmit stage, which may look cleaner.

Please help give some advice.


Thanks,
Victor



-Original Message-
From: Koenig, Christian 
Sent: Thursday, September 15, 2022 2:32 PM
To: Zhao, Victor ; amd-gfx@lists.freedesktop.org; Grodzovsky, 
Andrey 
Cc: Deng, Emily 
Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow



Am 15.09.22 um 06:02 schrieb Zhao, Victor:

[AMD Official Use Only - General]

Ping.

Hi @Koenig, Christian and @Grodzovsky, Andrey,

We found some reset related issues during stress test on the sequence. Please 
help give some comments.


Thanks,
Victor



-Original Message-
From: Victor Zhao 
Sent: Wednesday, September 14, 2022 6:10 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deng, Emily ; Grodzovsky, Andrey
; Zhao, Victor 
Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

[background]
For a gpu recovery caused by a hang on one ring (e.g. compute), jobs from 
another ring (e.g. gfx) may continue signaling during drm_sched_stop stage. The 
signal bit will not be cleared.

At the resubmit stage after recovery, the job with hw fence signaled bit set 
will call job done directly instead go through fence process.
This makes the hw_rq_count decrease but rcu fence pointer not cleared yet.

Then overflow happens in the fence driver slots and some jobs may be skipped 
and leave the rcu pointer not cleared which makes an infinite wait for the slot 
on the next fence emitted.

This infinite wait cause a job timeout on the emitting job. And driver will 
stuck at the its sched stop step because kthread_park cannot be done.

[how]
1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
before drm sched stop 2. handle all fences in fence process to aviod
skip when overflow happens

Signed-off-by: Victor Zhao 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +---  
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  6 +-
   2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 943c9e750575..c0cfae52f12b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
*adev,
amdgpu_virt_fini_data_exchange(adev);
}
   
-	amdgpu_fence_driver_isr_toggle(adev, true);

-
/* block all schedulers and reset given job's ring */
for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6 +5212,8 
@@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
  amdgpu_device_ip_need_full_reset(tmp_adev))
amdgpu_ras_suspend(tmp_adev);
   
+		amdgpu_fence_driver_isr_toggle(tmp_adev, true);

+
for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
struct amdgpu_ring *ring = tmp_adev->rings[i];
   
@@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,

atomic_inc(&tmp_adev->gpu_reset_counter);
}
   
-	if (need_emergency_restart)

+   if (need_emergency_restart) {
+   list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
+   amdgpu_fence_driver_isr_toggle(tmp_adev, false);
+ 

RE: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

2022-09-15 Thread Zhao, Victor
[AMD Official Use Only - General]

Hi Christian,

The test sequence is executing a compute engine hang while running a lot of 
containers submitting gfx jobs. We have advanced tdr mode and mode2 reset 
enabled on driver.
When a compute hang job timeout happens, the 2 jobs on the gfx pending list 
maybe signaled after drm_sched_stop. So they will not be removed from pending 
list but have the DMA_FENCE_FLAG_SIGNALED_BIT set.
At the amdgpu_device_recheck_guilty_jobs step, the first job will be rerun and 
removed from pending list.
At the resubmit setp, the second job (with signaled bit) will be resubmitted. 
Since it still has signaled bit, drm_sched_job_done will be called directly. 
This decrease the hw_rq_count which allows more jobs emitted but did not clean 
fence_drv rcu ptr. 
This results in an overflow in the fence_drv. Since we will use num_fences_mask 
in amdgpu_fence_process, when overflow happens, the signal of some job will be 
skipped which result in an infinite wait for the fence_drv rcu ptr.

So close irq before sched_stop could avoid signal jobs after drm_sched_stop. 
And signal job one by one in fence_process instead of using a mask will handle 
the overflow situation.

Another fix could be skip submitting jobs which already signaled during 
resubmit stage, which may look cleaner.

Please help give some advice.


Thanks,
Victor



-Original Message-
From: Koenig, Christian  
Sent: Thursday, September 15, 2022 2:32 PM
To: Zhao, Victor ; amd-gfx@lists.freedesktop.org; 
Grodzovsky, Andrey 
Cc: Deng, Emily 
Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow



Am 15.09.22 um 06:02 schrieb Zhao, Victor:
> [AMD Official Use Only - General]
>
> Ping.
>
> Hi @Koenig, Christian and @Grodzovsky, Andrey,
>
> We found some reset related issues during stress test on the sequence. Please 
> help give some comments.
>
>
> Thanks,
> Victor
>
>
>
> -Original Message-
> From: Victor Zhao 
> Sent: Wednesday, September 14, 2022 6:10 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deng, Emily ; Grodzovsky, Andrey 
> ; Zhao, Victor 
> Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>
> [background]
> For a gpu recovery caused by a hang on one ring (e.g. compute), jobs from 
> another ring (e.g. gfx) may continue signaling during drm_sched_stop stage. 
> The signal bit will not be cleared.
>
> At the resubmit stage after recovery, the job with hw fence signaled bit set 
> will call job done directly instead go through fence process.
> This makes the hw_rq_count decrease but rcu fence pointer not cleared yet.
>
> Then overflow happens in the fence driver slots and some jobs may be skipped 
> and leave the rcu pointer not cleared which makes an infinite wait for the 
> slot on the next fence emitted.
>
> This infinite wait cause a job timeout on the emitting job. And driver will 
> stuck at the its sched stop step because kthread_park cannot be done.
>
> [how]
> 1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt 
> before drm sched stop 2. handle all fences in fence process to aviod 
> skip when overflow happens
>
> Signed-off-by: Victor Zhao 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +---  
> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  6 +-
>   2 files changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 943c9e750575..c0cfae52f12b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
> *adev,
>   amdgpu_virt_fini_data_exchange(adev);
>   }
>   
> - amdgpu_fence_driver_isr_toggle(adev, true);
> -
>   /* block all schedulers and reset given job's ring */
>   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>   struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6 +5212,8 
> @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> amdgpu_device_ip_need_full_reset(tmp_adev))
>   amdgpu_ras_suspend(tmp_adev);
>   
> + amdgpu_fence_driver_isr_toggle(tmp_adev, true);
> +
>   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>   struct amdgpu_ring *ring = tmp_adev->rings[i];
>   
> @@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>   atomic_inc(&tmp_adev->gpu_reset_counter);
>   }
>   
> - if (need_emergency_restart)
> + if (need_emergency_restart) {
> + list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
> +  

Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

2022-09-14 Thread Christian König




Am 15.09.22 um 06:02 schrieb Zhao, Victor:

[AMD Official Use Only - General]

Ping.

Hi @Koenig, Christian and @Grodzovsky, Andrey,

We found some reset related issues during stress test on the sequence. Please 
help give some comments.


Thanks,
Victor



-Original Message-
From: Victor Zhao 
Sent: Wednesday, September 14, 2022 6:10 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deng, Emily ; Grodzovsky, Andrey ; 
Zhao, Victor 
Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

[background]
For a gpu recovery caused by a hang on one ring (e.g. compute), jobs from 
another ring (e.g. gfx) may continue signaling during drm_sched_stop stage. The 
signal bit will not be cleared.

At the resubmit stage after recovery, the job with hw fence signaled bit set 
will call job done directly instead go through fence process.
This makes the hw_rq_count decrease but rcu fence pointer not cleared yet.

Then overflow happens in the fence driver slots and some jobs may be skipped 
and leave the rcu pointer not cleared which makes an infinite wait for the slot 
on the next fence emitted.

This infinite wait cause a job timeout on the emitting job. And driver will 
stuck at the its sched stop step because kthread_park cannot be done.

[how]
1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt before drm 
sched stop 2. handle all fences in fence process to aviod skip when overflow 
happens

Signed-off-by: Victor Zhao 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +---  
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  6 +-
  2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 943c9e750575..c0cfae52f12b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
*adev,
amdgpu_virt_fini_data_exchange(adev);
}
  
-	amdgpu_fence_driver_isr_toggle(adev, true);

-
/* block all schedulers and reset given job's ring */
for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6 +5212,8 
@@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
  amdgpu_device_ip_need_full_reset(tmp_adev))
amdgpu_ras_suspend(tmp_adev);
  
+		amdgpu_fence_driver_isr_toggle(tmp_adev, true);

+
for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
struct amdgpu_ring *ring = tmp_adev->rings[i];
  
@@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,

atomic_inc(&tmp_adev->gpu_reset_counter);
}
  
-	if (need_emergency_restart)

+   if (need_emergency_restart) {
+   list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
+   amdgpu_fence_driver_isr_toggle(tmp_adev, false);
+   }
goto skip_sched_resume;
+   }
  
  	/*

 * Must check guilty signal here since after this point all old @@ 
-5240,6 +5244,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
if (job && dma_fence_is_signaled(&job->hw_fence)) {
job_signaled = true;
dev_info(adev->dev, "Guilty job already signaled, skipping HW 
reset");
+   list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
+   amdgpu_fence_driver_isr_toggle(tmp_adev, false);
+   }
goto skip_hw_reset;
}
  
@@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,

if (r && r == -EAGAIN) {
set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context->flags);
adev->asic_reset_res = 0;
+   amdgpu_fence_driver_isr_toggle(adev, true);
goto retry;
}
}
@@ -5711,6 +5719,8 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev 
*pdev)
set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
  
+	amdgpu_fence_driver_isr_toggle(adev, true);

+
adev->no_hw_access = true;
r = amdgpu_device_pre_asic_reset(adev, &reset_context);
adev->no_hw_access = false;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 8adeb7469f1e..65a877e1a7fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct amdgpu_ring *ring)
if (unlikely(seq == last_seq))
return false;
  
-	last_seq &= drv->num_fences_mask;

-   seq &= drv->num_fences_mask;
-
do {
struct dma_fence *fence, **ptr;
  
  		++last_seq;

-   last_seq &= 

RE: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

2022-09-14 Thread Zhao, Victor
[AMD Official Use Only - General]

Ping.

Hi @Koenig, Christian and @Grodzovsky, Andrey,

We found some reset related issues during stress test on the sequence. Please 
help give some comments.


Thanks,
Victor



-Original Message-
From: Victor Zhao  
Sent: Wednesday, September 14, 2022 6:10 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deng, Emily ; Grodzovsky, Andrey 
; Zhao, Victor 
Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

[background]
For a gpu recovery caused by a hang on one ring (e.g. compute), jobs from 
another ring (e.g. gfx) may continue signaling during drm_sched_stop stage. The 
signal bit will not be cleared.

At the resubmit stage after recovery, the job with hw fence signaled bit set 
will call job done directly instead go through fence process.
This makes the hw_rq_count decrease but rcu fence pointer not cleared yet.

Then overflow happens in the fence driver slots and some jobs may be skipped 
and leave the rcu pointer not cleared which makes an infinite wait for the slot 
on the next fence emitted.

This infinite wait cause a job timeout on the emitting job. And driver will 
stuck at the its sched stop step because kthread_park cannot be done.

[how]
1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt before drm 
sched stop 2. handle all fences in fence process to aviod skip when overflow 
happens

Signed-off-by: Victor Zhao 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +---  
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  6 +-
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 943c9e750575..c0cfae52f12b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
*adev,
amdgpu_virt_fini_data_exchange(adev);
}
 
-   amdgpu_fence_driver_isr_toggle(adev, true);
-
/* block all schedulers and reset given job's ring */
for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6 +5212,8 
@@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
  amdgpu_device_ip_need_full_reset(tmp_adev))
amdgpu_ras_suspend(tmp_adev);
 
+   amdgpu_fence_driver_isr_toggle(tmp_adev, true);
+
for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
struct amdgpu_ring *ring = tmp_adev->rings[i];
 
@@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
atomic_inc(&tmp_adev->gpu_reset_counter);
}
 
-   if (need_emergency_restart)
+   if (need_emergency_restart) {
+   list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
+   amdgpu_fence_driver_isr_toggle(tmp_adev, false);
+   }
goto skip_sched_resume;
+   }
 
/*
 * Must check guilty signal here since after this point all old @@ 
-5240,6 +5244,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
if (job && dma_fence_is_signaled(&job->hw_fence)) {
job_signaled = true;
dev_info(adev->dev, "Guilty job already signaled, skipping HW 
reset");
+   list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
+   amdgpu_fence_driver_isr_toggle(tmp_adev, false);
+   }
goto skip_hw_reset;
}
 
@@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
if (r && r == -EAGAIN) {
set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context->flags);
adev->asic_reset_res = 0;
+   amdgpu_fence_driver_isr_toggle(adev, true);
goto retry;
}
}
@@ -5711,6 +5719,8 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev 
*pdev)
set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
 
+   amdgpu_fence_driver_isr_toggle(adev, true);
+
adev->no_hw_access = true;
r = amdgpu_device_pre_asic_reset(adev, &reset_context);
adev->no_hw_access = false;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 8adeb7469f1e..65a877e1a7fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct amdgpu_ring *ring)
if (unlikely(seq == last_seq))
return false;
 
-   last_seq &= drv->num_fences_mask;
-   seq &= drv->num_fences_mask;
-
do {
struct dma_fence *fence, **ptr;
 
++last_seq;
-   last_seq &= drv->num_fenc