RE: [PATCH] drm/amdgpu: Increase tlb flush timeout for sriov

2021-05-19 Thread Liu, Cheng Zhe
[AMD Official Use Only]

We support 12 VF at most. In worst case, the first 11 all IDLE fail and do FLR, 
it will need 11 * 500ms to switch to the 12nd VF,
so I set 12 * 500ms  for the timeout.

-Original Message-
From: Christian König  
Sent: Wednesday, May 19, 2021 6:08 PM
To: Liu, Cheng Zhe ; amd-gfx@lists.freedesktop.org
Cc: Xiao, Jack ; Xu, Feifei ; Wang, 
Kevin(Yang) ; Tuikov, Luben ; 
Deucher, Alexander ; Koenig, Christian 
; Zhang, Hawking 
Subject: Re: [PATCH] drm/amdgpu: Increase tlb flush timeout for sriov

Am 19.05.21 um 11:32 schrieb Chengzhe Liu:
> When there is 12 VF, we need to increase the timeout

NAK, 6 seconds is way to long to wait polling on a fence.

Why should an invalidation take that long? The engine are per VF just to avoid 
exactly that problem.

Christian.

>
> Signed-off-by: Chengzhe Liu 
> ---
>   drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 6 +-
>   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 6 +-
>   2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> index f02dc904e4cf..a5f005c5d0ec 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> @@ -404,6 +404,7 @@ static int gmc_v10_0_flush_gpu_tlb_pasid(struct 
> amdgpu_device *adev,
>   uint32_t seq;
>   uint16_t queried_pasid;
>   bool ret;
> + uint32_t sriov_usec_timeout = 600;  /* wait for 12 * 500ms for 
> +SRIOV */
>   struct amdgpu_ring *ring = >gfx.kiq.ring;
>   struct amdgpu_kiq *kiq = >gfx.kiq;
>   
> @@ -422,7 +423,10 @@ static int gmc_v10_0_flush_gpu_tlb_pasid(struct 
> amdgpu_device *adev,
>   
>   amdgpu_ring_commit(ring);
>   spin_unlock(>gfx.kiq.ring_lock);
> - r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);
> + if (amdgpu_sriov_vf(adev))
> + r = amdgpu_fence_wait_polling(ring, seq, 
> sriov_usec_timeout);
> + else
> + r = amdgpu_fence_wait_polling(ring, seq, 
> adev->usec_timeout);
>   if (r < 1) {
>   dev_err(adev->dev, "wait for kiq fence error: %ld.\n", 
> r);
>   return -ETIME;
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index ceb3968d8326..e4a18d8f75c2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -857,6 +857,7 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct 
> amdgpu_device *adev,
>   uint32_t seq;
>   uint16_t queried_pasid;
>   bool ret;
> + uint32_t sriov_usec_timeout = 600;  /* wait for 12 * 500ms for 
> +SRIOV */
>   struct amdgpu_ring *ring = >gfx.kiq.ring;
>   struct amdgpu_kiq *kiq = >gfx.kiq;
>   
> @@ -896,7 +897,10 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct 
> amdgpu_device *adev,
>   
>   amdgpu_ring_commit(ring);
>   spin_unlock(>gfx.kiq.ring_lock);
> - r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);
> + if (amdgpu_sriov_vf(adev))
> + r = amdgpu_fence_wait_polling(ring, seq, 
> sriov_usec_timeout);
> + else
> + r = amdgpu_fence_wait_polling(ring, seq, 
> adev->usec_timeout);
>   if (r < 1) {
>   dev_err(adev->dev, "wait for kiq fence error: %ld.\n", 
> r);
>   up_read(>reset_sem);
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu: Return -EINVAL when whole gpu reset happened

2020-12-10 Thread Liu, Cheng Zhe
[AMD Public Use]

Hi Christian, 

I discussed with @Liu, Monk and @Chen, Horace, it hasn't decided yet.  I'll add 
you and Alex to the mailing list. 

-Original Message-
From: Christian König  
Sent: Thursday, December 10, 2020 5:29 PM
To: Liu, Cheng Zhe ; Koenig, Christian 
; amd-gfx@lists.freedesktop.org
Cc: Xiao, Jack ; Xu, Feifei ; Wang, 
Kevin(Yang) ; Tuikov, Luben ; 
Deucher, Alexander ; Yuan, Xiaojie 
; Zhang, Hawking 
Subject: Re: [PATCH] drm/amdgpu: Return -EINVAL when whole gpu reset happened

Hi Cheng,

well with whom you discussed that? Such discussions should happen on the 
mailing list and at minimum include Alex and me.

Anyway as discussed before this is not acceptable. We already discussed this 
years ago and the behavior is fixed and not changeable.

Regards,
Christian.

Am 10.12.20 um 03:25 schrieb Liu, Cheng Zhe:
> [AMD Public Use]
>
> Yeah, we discussed this issue again. We think it's better UMD make some 
> changes instead of changing in KMD. If FLR happened, it's OK for UMD create 
> new context and continue to submit jobs.
> However, if BACO or mode 1 reset happens, of course UMD could also submit 
> jobs, but these jobs can't use any resource create before the reset, 
> including page table.
> Because all the contents in VRAM has lost after BACO or mode 1 reset, which 
> including APP's buffer.
>
> -Original Message-
> From: Koenig, Christian 
> Sent: Wednesday, December 9, 2020 6:06 PM
> To: Liu, Cheng Zhe ; 
> amd-gfx@lists.freedesktop.org
> Cc: Tuikov, Luben ; Deucher, Alexander 
> ; Xiao, Jack ; Zhang, 
> Hawking ; Xu, Feifei ; Wang, 
> Kevin(Yang) ; Yuan, Xiaojie 
> 
> Subject: Re: [PATCH] drm/amdgpu: Return -EINVAL when whole gpu reset 
> happened
>
> Am 09.12.20 um 10:46 schrieb Liu ChengZhe:
>> If CS init return -ECANCELED, UMD will free and create new context.
>> Job in this new context could conitnue exexcuting. In the case of 
>> BACO or mode 1, we can't allow this happpen. Because VRAM has lost 
>> after whole gpu reset, the job can't guarantee to succeed.
> NAK, this is intentional.
>
> When ECANCELED is returned UMD should create new context after a GPU reset to 
> get back into an usable state and continue to submit jobs.
>
> Regards,
> Christian.
>
>> Signed-off-by: Liu ChengZhe 
>> ---
>>drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 9 +++--
>>1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> index 85e48c29a57c..2a98f58134ed 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> @@ -120,6 +120,7 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser 
>> *p, union drm_amdgpu_cs
>>  uint64_t *chunk_array;
>>  unsigned size, num_ibs = 0;
>>  uint32_t uf_offset = 0;
>> +uint32_t vramlost_count = 0;
>>  int i;
>>  int ret;
>>
>> @@ -140,7 +141,11 @@ static int amdgpu_cs_parser_init(struct 
>> amdgpu_cs_parser *p, union drm_amdgpu_cs
>>
>>  /* skip guilty context job */
>>  if (atomic_read(>ctx->guilty) == 1) {
>> -ret = -ECANCELED;
>> +vramlost_count = atomic_read(>adev->vram_lost_counter);
>> +if (p->ctx->vram_lost_counter != vramlost_count)
>> +ret = -EINVAL;
>> +else
>> +ret = -ECANCELED;
>>  goto free_chunk;
>>  }
>>
>> @@ -246,7 +251,7 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser 
>> *p, union drm_amdgpu_cs
>>  goto free_all_kdata;
>>
>>  if (p->ctx->vram_lost_counter != p->job->vram_lost_counter) {
>> -ret = -ECANCELED;
>> +ret = -EINVAL;
>>  goto free_all_kdata;
>>  }
>>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=04%7C01%7CCh
> engZhe.Liu%40amd.com%7C56e4519b79184cdab3e508d89cedf8ba%7C3dd8961fe488
> 4e608e11a82d994e183d%7C0%7C0%7C637431893767203117%7CUnknown%7CTWFpbGZs
> b3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D
> %7C3000sdata=sQIc%2FhxC%2BotX49yD7Vhw5FfxL96Nz4IPAkcXMUVZS6c%3D
> mp;reserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu: Return -EINVAL when whole gpu reset happened

2020-12-09 Thread Liu, Cheng Zhe
[AMD Public Use]

Yeah, we discussed this issue again. We think it's better UMD make some changes 
instead of changing in KMD. If FLR happened, it's OK for UMD create new context 
and continue to submit jobs.
However, if BACO or mode 1 reset happens, of course UMD could also submit jobs, 
but these jobs can't use any resource create before the reset, including page 
table.
Because all the contents in VRAM has lost after BACO or mode 1 reset, which 
including APP's buffer.

-Original Message-
From: Koenig, Christian  
Sent: Wednesday, December 9, 2020 6:06 PM
To: Liu, Cheng Zhe ; amd-gfx@lists.freedesktop.org
Cc: Tuikov, Luben ; Deucher, Alexander 
; Xiao, Jack ; Zhang, Hawking 
; Xu, Feifei ; Wang, Kevin(Yang) 
; Yuan, Xiaojie 
Subject: Re: [PATCH] drm/amdgpu: Return -EINVAL when whole gpu reset happened

Am 09.12.20 um 10:46 schrieb Liu ChengZhe:
> If CS init return -ECANCELED, UMD will free and create new context.
> Job in this new context could conitnue exexcuting. In the case of BACO 
> or mode 1, we can't allow this happpen. Because VRAM has lost after 
> whole gpu reset, the job can't guarantee to succeed.

NAK, this is intentional.

When ECANCELED is returned UMD should create new context after a GPU reset to 
get back into an usable state and continue to submit jobs.

Regards,
Christian.

>
> Signed-off-by: Liu ChengZhe 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 9 +++--
>   1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 85e48c29a57c..2a98f58134ed 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -120,6 +120,7 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser 
> *p, union drm_amdgpu_cs
>   uint64_t *chunk_array;
>   unsigned size, num_ibs = 0;
>   uint32_t uf_offset = 0;
> + uint32_t vramlost_count = 0;
>   int i;
>   int ret;
>   
> @@ -140,7 +141,11 @@ static int amdgpu_cs_parser_init(struct 
> amdgpu_cs_parser *p, union drm_amdgpu_cs
>   
>   /* skip guilty context job */
>   if (atomic_read(>ctx->guilty) == 1) {
> - ret = -ECANCELED;
> + vramlost_count = atomic_read(>adev->vram_lost_counter);
> + if (p->ctx->vram_lost_counter != vramlost_count)
> + ret = -EINVAL;
> + else
> + ret = -ECANCELED;
>   goto free_chunk;
>   }
>   
> @@ -246,7 +251,7 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser 
> *p, union drm_amdgpu_cs
>   goto free_all_kdata;
>   
>   if (p->ctx->vram_lost_counter != p->job->vram_lost_counter) {
> - ret = -ECANCELED;
> + ret = -EINVAL;
>   goto free_all_kdata;
>   }
>   
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu: fix PSP autoload twice in FLR

2020-07-29 Thread Liu, Cheng Zhe
[AMD Official Use Only - Internal Distribution Only]

Thanks for the modification!

-Original Message-
From: Tuikov, Luben  
Sent: Wednesday, July 29, 2020 2:49 AM
To: Liu, Cheng Zhe ; amd-gfx@lists.freedesktop.org
Cc: Koenig, Christian ; Deucher, Alexander 
; Xiao, Jack ; Zhang, Hawking 
; Xu, Feifei ; Wang, Kevin(Yang) 
; Yuan, Xiaojie 
Subject: Re: [PATCH] drm/amdgpu: fix PSP autoload twice in FLR

On 2020-07-28 2:04 p.m., Luben Tuikov wrote:
> Thanks for removing the braces.
> 
> On 2020-07-27 10:29 p.m., Liu ChengZhe wrote:
>> the block->status.hw = false assignment will overwrite PSP's previous
> ^^
> You want to start a sentence here. Capitalize "The".
> Also don't use future tense in commit descriptions (and commit titles).
> Simply use present tense. Using future tense makes it confusing if 
> this is what the code used to do before this change or if the code is 
> doing this right now as someone is reading the commit in the future with "git 
> log".
> 
>> hw status, which will cause PSP execute resume operation after hw init.
> 
> I've found it best to describe what's being done as if telling a story.
> Break it down into "tell what's happening" and "tell what what's fixed 
> and how it affects the rest of the system". Something like this:
> 
>   Assigning false to block->status.hw overwrites PSP's previous
>   hardware status, which causes the PSP to resume operation after
>   hardware init.
> 
>   Remove this assignment and let the PSP start when it is told to.
> 
> Check if the above rendition of your change is correct, and use it if so.

Double checking now, since "resume" is an op, you should capitalize it.

"... which causes the PSP to execute Resume operation right  after hardware 
init.

 Remove this assignment and let the PSP execute Resume operation when it  is 
told to do so."

Or something to that effect.

Regards,
Luben

> 
> Regards,
> Luben
> 
>>
>> v2: (R)remove the braces(.)
> 
> 
> 
>>
>> Signed-off-by: Liu ChengZhe 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 62ecac97fbd2..5d9affa1d35a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2574,6 +2574,9 @@ static int amdgpu_device_ip_reinit_early_sriov(struct 
>> amdgpu_device *adev)
>>  AMD_IP_BLOCK_TYPE_IH,
>>  };
>>  
>> +for (i = 0; i < adev->num_ip_blocks; i++)
>> +adev->ip_blocks[i].status.hw = false;
>> +
>>  for (i = 0; i < ARRAY_SIZE(ip_order); i++) {
>>  int j;
>>  struct amdgpu_ip_block *block;
>> @@ -2581,7 +2584,6 @@ static int amdgpu_device_ip_reinit_early_sriov(struct 
>> amdgpu_device *adev)
>>  for (j = 0; j < adev->num_ip_blocks; j++) {
>>  block = >ip_blocks[j];
>>  
>> -block->status.hw = false;
>>  if (block->version->type != ip_order[i] ||
>>  !block->status.valid)
>>  continue;
>>
> 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx