RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

2019-12-18 Thread Liu, Monk
>>> I would like to check why we need a special sequences for sriov on this 
>>> pre_reset. If possible, make it the same as bare metal mode sounds better 
>>> solution.

Because before VF FLR calling function would lead to register access through 
KIQ,  which will not complete because KIQ/GFX already hang by that time

>>> I don't remember any register access by amdkfd_pre_reset call,   please let 
>>> me know if this assumption is wrong .

Please check "void pm_uninit(struct packet_manager *pm)" which is invoked 
inside of amdkfd_pre_reset() :

It will call uninitialized() in kfd_kernel_queue.c file

And then go to the path of "kq->mqd_mgr->destroy_mqd(...)"

And finally it calls "static int kgd_hqd_destroy(...)" in 
amdgpu_amdkfd_gfx_v10.c


539 {
540 struct amdgpu_device *adev = get_amdgpu_device(kgd);
541 enum hqd_dequeue_request_type type;
542 unsigned long end_jiffies;
543 uint32_t temp;
544 struct v10_compute_mqd *m = get_mqd(mqd);
545
546 #if 0
547 unsigned long flags;
548 int retry;
549 #endif
550
551 acquire_queue(kgd, pipe_id, queue_id); //this introduce register access 
via KIQ
552
553 if (m->cp_hqd_vmid == 0)
554 WREG32_FIELD15(GC, 0, RLC_CP_SCHEDULERS, scheduler1, 0); //this 
introduce register access via KIQ
555
556 switch (reset_type) {
557 case KFD_PREEMPT_TYPE_WAVEFRONT_DRAIN:
558 type = DRAIN_PIPE;
559 break;
560 case KFD_PREEMPT_TYPE_WAVEFRONT_RESET:
561 type = RESET_WAVES;
562 break;
563 default:
564 type = DRAIN_PIPE;
565 break;
566 }
624 WREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_DEQUEUE_REQUEST), type); //this 
introduce register access via KIQ
625
626 end_jiffies = (utimeout * HZ / 1000) + jiffies;
627 while (true) {
628 temp = RREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_ACTIVE)); //this 
introduce register access via KIQ
629 if (!(temp & CP_HQD_ACTIVE__ACTIVE_MASK))
630 break;
631 if (time_after(jiffies, end_jiffies)) {
632 pr_err("cp queue preemption time out.\n");
633 release_queue(kgd);
634 return -ETIME;
635 }
636 usleep_range(500, 1000);
637 }
638
639 release_queue(kgd);
640 return 0;

If we use the sequence from bare-metal, all above highlighted register access 
will not work because KIQ/GFX already died by that time which means the 
amdkfd_pre_reset() is actually  not working as expected.

_
Monk Liu|GPU Virtualization Team |AMD
[sig-cloud-gpu]

From: Liu, Shaoyun 
Sent: Thursday, December 19, 2019 12:30 PM
To: Liu, Monk ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

I don't remember any register access by amdkfd_pre_reset call,   please let me 
know if this assumption is wrong .
This function will use hiq to access CP, in case CP already hang, we might not 
able to get the response from hw and will got a timeout. I think kfd internal 
should handle this. Felix already have some comments on that.
I would like to check why we need a special sequences for sriov on this 
pre_reset. If possible, make it the same as bare metal mode sounds better 
solution.

Regards
Shaoyun.liu

From: Liu, Monk mailto:monk@amd.com>>
Sent: December 18, 2019 10:52:47 PM
To: Liu, Shaoyun mailto:shaoyun@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>
Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

Oh, by the way

>>> Do we know the root cause why this function would ruin MEC ?

Only we call this function right after VF FLR will ruin MEC and lead to 
following KIQ ring test fail , and on bare-metal it is called before gpu rest , 
so that's why on bare-metal we don't have this issue

But the reason we cannot call it before VF FLR on SRIOV case was already stated 
in this thread

Thanks
_
Monk Liu|GPU Virtualization Team |AMD


-Original Message-
From: Liu, Monk
Sent: Thursday, December 19, 2019 11:49 AM
To: shaoyunl mailto:shaoyun....@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

Hi Shaoyun

>>> Do we know the root cause why this function would ruin MEC ? From the 
>>> logic, I think this function should be called before FLR since we need to 
>>> disable the user queue submission first.
Right now I don't know which detail step lead to KIQ ring test fail, I totally 
agree with you that this func should be called before VF FLR, but we cannot do 
it and the reason is described in The comment:

> if we do pre_reset() before VF FLR,

Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

2019-12-18 Thread Liu, Shaoyun
I don't remember any register access by amdkfd_pre_reset call,   please let me 
know if this assumption is wrong .
This function will use hiq to access CP, in case CP already hang, we might not 
able to get the response from hw and will got a timeout. I think kfd internal 
should handle this. Felix already have some comments on that.
I would like to check why we need a special sequences for sriov on this 
pre_reset. If possible, make it the same as bare metal mode sounds better 
solution.

Regards
Shaoyun.liu

From: Liu, Monk 
Sent: December 18, 2019 10:52:47 PM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org 

Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

Oh, by the way

>>> Do we know the root cause why this function would ruin MEC ?

Only we call this function right after VF FLR will ruin MEC and lead to 
following KIQ ring test fail , and on bare-metal it is called before gpu rest , 
so that's why on bare-metal we don't have this issue

But the reason we cannot call it before VF FLR on SRIOV case was already stated 
in this thread

Thanks
_
Monk Liu|GPU Virtualization Team |AMD


-Original Message-
From: Liu, Monk
Sent: Thursday, December 19, 2019 11:49 AM
To: shaoyunl ; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

Hi Shaoyun

>>> Do we know the root cause why this function would ruin MEC ? From the 
>>> logic, I think this function should be called before FLR since we need to 
>>> disable the user queue submission first.
Right now I don't know which detail step lead to KIQ ring test fail, I totally 
agree with you that this func should be called before VF FLR, but we cannot do 
it and the reason is described in The comment:

> if we do pre_reset() before VF FLR, it would go KIQ way to do register
> access and stuck there, because KIQ probably won't work by that time
> (e.g. you already made GFX hang)


>>> I remembered the function should use hiq to communicate with HW , shouldn't 
>>> use kiq to access HW registerm,  has this been changed ?
Tis function use WREG32/RREG32 to do register access, like all other functions 
in KMD,  and WREG32/RREG32 will let KIQ to do the register access If we are 
under dynamic SRIOV  mode (means we are SRIOV VF and isn't under full exclusive 
mode)

You see that if you call this func before EVENT_5 (event 5 triggers VF FLR) 
then it will run under dynamic mode and KIQ will handle the register access, 
which is not an option Since ME/MEC probably already hang ( if we are testing 
quark on gfx/compute rings)

Do you have a good suggestion ?

thanks
_
Monk Liu|GPU Virtualization Team |AMD


-Original Message-
From: amd-gfx  On Behalf Of shaoyunl
Sent: Tuesday, December 17, 2019 11:38 PM
To: amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

I think amdkfd side depends on this call to stop the user queue, without this 
call, the user queue can submit to HW during the reset which could cause hang 
again ...
Do we know the root cause why this function would ruin MEC ? From the logic, I 
think this function should be called before FLR since we need to disable the 
user queue submission first.
I remembered the function should use hiq to communicate with HW , shouldn't use 
kiq to access HW registerm,  has this been changed ?


Regards
shaoyun.liu


On 2019-12-17 5:19 a.m., Monk Liu wrote:
> issues:
> MEC is ruined by the amdkfd_pre_reset after VF FLR done
>
> fix:
> amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF
> FLR, the correct sequence is do amdkfd_pre_reset before VF FLR but
> there is a limitation to block this sequence:
> if we do pre_reset() before VF FLR, it would go KIQ way to do register
> access and stuck there, because KIQ probably won't work by that time
> (e.g. you already made GFX hang)
>
> so the best way right now is to simply remove it.
>
> Signed-off-by: Monk Liu 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
>   1 file changed, 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 605cef6..ae962b9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct 
> amdgpu_device *adev,
>if (r)
>return r;
>
> - amdgpu_amdkfd_pre_reset(adev);
> -
>/* Resume IP prior to SMC */
>r = amdgpu_device_ip_reinit_early_sriov(adev);
>if (r)
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.

RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

2019-12-18 Thread Liu, Monk
Oh, by the way

>>> Do we know the root cause why this function would ruin MEC ?

Only we call this function right after VF FLR will ruin MEC and lead to 
following KIQ ring test fail , and on bare-metal it is called before gpu rest , 
so that's why on bare-metal we don't have this issue 

But the reason we cannot call it before VF FLR on SRIOV case was already stated 
in this thread 

Thanks
_
Monk Liu|GPU Virtualization Team |AMD


-Original Message-
From: Liu, Monk 
Sent: Thursday, December 19, 2019 11:49 AM
To: shaoyunl ; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

Hi Shaoyun

>>> Do we know the root cause why this function would ruin MEC ? From the 
>>> logic, I think this function should be called before FLR since we need to 
>>> disable the user queue submission first.
Right now I don't know which detail step lead to KIQ ring test fail, I totally 
agree with you that this func should be called before VF FLR, but we cannot do 
it and the reason is described in The comment:

> if we do pre_reset() before VF FLR, it would go KIQ way to do register 
> access and stuck there, because KIQ probably won't work by that time 
> (e.g. you already made GFX hang)


>>> I remembered the function should use hiq to communicate with HW , shouldn't 
>>> use kiq to access HW registerm,  has this been changed ?
Tis function use WREG32/RREG32 to do register access, like all other functions 
in KMD,  and WREG32/RREG32 will let KIQ to do the register access If we are 
under dynamic SRIOV  mode (means we are SRIOV VF and isn't under full exclusive 
mode)

You see that if you call this func before EVENT_5 (event 5 triggers VF FLR) 
then it will run under dynamic mode and KIQ will handle the register access, 
which is not an option Since ME/MEC probably already hang ( if we are testing 
quark on gfx/compute rings)

Do you have a good suggestion ?

thanks
_
Monk Liu|GPU Virtualization Team |AMD


-Original Message-
From: amd-gfx  On Behalf Of shaoyunl
Sent: Tuesday, December 17, 2019 11:38 PM
To: amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

I think amdkfd side depends on this call to stop the user queue, without this 
call, the user queue can submit to HW during the reset which could cause hang 
again ...
Do we know the root cause why this function would ruin MEC ? From the logic, I 
think this function should be called before FLR since we need to disable the 
user queue submission first.
I remembered the function should use hiq to communicate with HW , shouldn't use 
kiq to access HW registerm,  has this been changed ?


Regards
shaoyun.liu


On 2019-12-17 5:19 a.m., Monk Liu wrote:
> issues:
> MEC is ruined by the amdkfd_pre_reset after VF FLR done
>
> fix:
> amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF 
> FLR, the correct sequence is do amdkfd_pre_reset before VF FLR but 
> there is a limitation to block this sequence:
> if we do pre_reset() before VF FLR, it would go KIQ way to do register 
> access and stuck there, because KIQ probably won't work by that time 
> (e.g. you already made GFX hang)
>
> so the best way right now is to simply remove it.
>
> Signed-off-by: Monk Liu 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
>   1 file changed, 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 605cef6..ae962b9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct 
> amdgpu_device *adev,
>   if (r)
>   return r;
>   
> - amdgpu_amdkfd_pre_reset(adev);
> -
>   /* Resume IP prior to SMC */
>   r = amdgpu_device_ip_reinit_early_sriov(adev);
>   if (r)
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Cmonk.liu%40amd.com%7Cee9c811452634fc2739808d7830718f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637121938885721447sdata=FiqkgiUX8k5rD%2F%2FiJQU2cF1MGExO8yXEzYOoBtpdfYU%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

2019-12-18 Thread Liu, Monk
Hi Shaoyun

>>> Do we know the root cause why this function would ruin MEC ? From the 
>>> logic, I think this function should be called before FLR since we need to 
>>> disable the user queue submission first.
Right now I don't know which detail step lead to KIQ ring test fail, I totally 
agree with you that this func should be called before VF FLR, but we cannot do 
it and the reason is described in 
The comment:

> if we do pre_reset() before VF FLR, it would go KIQ way to do register 
> access and stuck there, because KIQ probably won't work by that time 
> (e.g. you already made GFX hang)


>>> I remembered the function should use hiq to communicate with HW , shouldn't 
>>> use kiq to access HW registerm,  has this been changed ?
Tis function use WREG32/RREG32 to do register access, like all other functions 
in KMD,  and WREG32/RREG32 will let KIQ to do the register access
If we are under dynamic SRIOV  mode (means we are SRIOV VF and isn't under full 
exclusive mode)

You see that if you call this func before EVENT_5 (event 5 triggers VF FLR) 
then it will run under dynamic mode and KIQ will handle the register access, 
which is not an option 
Since ME/MEC probably already hang ( if we are testing quark on gfx/compute 
rings)

Do you have a good suggestion ?

thanks
_
Monk Liu|GPU Virtualization Team |AMD


-Original Message-
From: amd-gfx  On Behalf Of shaoyunl
Sent: Tuesday, December 17, 2019 11:38 PM
To: amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

I think amdkfd side depends on this call to stop the user queue, without this 
call, the user queue can submit to HW during the reset which could cause hang 
again ...
Do we know the root cause why this function would ruin MEC ? From the logic, I 
think this function should be called before FLR since we need to disable the 
user queue submission first.
I remembered the function should use hiq to communicate with HW , shouldn't use 
kiq to access HW registerm,  has this been changed ?


Regards
shaoyun.liu


On 2019-12-17 5:19 a.m., Monk Liu wrote:
> issues:
> MEC is ruined by the amdkfd_pre_reset after VF FLR done
>
> fix:
> amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF 
> FLR, the correct sequence is do amdkfd_pre_reset before VF FLR but 
> there is a limitation to block this sequence:
> if we do pre_reset() before VF FLR, it would go KIQ way to do register 
> access and stuck there, because KIQ probably won't work by that time 
> (e.g. you already made GFX hang)
>
> so the best way right now is to simply remove it.
>
> Signed-off-by: Monk Liu 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
>   1 file changed, 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 605cef6..ae962b9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct 
> amdgpu_device *adev,
>   if (r)
>   return r;
>   
> - amdgpu_amdkfd_pre_reset(adev);
> -
>   /* Resume IP prior to SMC */
>   r = amdgpu_device_ip_reinit_early_sriov(adev);
>   if (r)
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Cmonk.liu%40amd.com%7Cee9c811452634fc2739808d7830718f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637121938885721447sdata=FiqkgiUX8k5rD%2F%2FiJQU2cF1MGExO8yXEzYOoBtpdfYU%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

2019-12-17 Thread Felix Kuehling

I agree. Removing the call to pre-reset probably breaks GPU reset for KFD.

We call the KFD suspend function in pre-reset, which uses the HIQ to 
stop any user mode queues still running. If that is not possible because 
the HIQ is hanging, it should fail with a timeout. There may be 
something we can do if we know that the HIQ is hanging, so we only 
update the KFD-internal queue state without actually sending anything to 
the HIQ.


Regards,
  Felix

On 2019-12-17 10:37, shaoyunl wrote:
I think amdkfd side depends on this call to stop the user queue, 
without this call, the user queue can submit to HW during the reset 
which could cause hang again ...
Do we know the root cause why this function would ruin MEC ? From the 
logic, I think this function should be called before FLR since we need 
to disable the user queue submission first.
I remembered the function should use hiq to communicate with HW , 
shouldn't use kiq to access HW registerm,  has this been changed ?



Regards
shaoyun.liu


On 2019-12-17 5:19 a.m., Monk Liu wrote:

issues:
MEC is ruined by the amdkfd_pre_reset after VF FLR done

fix:
amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF FLR,
the correct sequence is do amdkfd_pre_reset before VF FLR but there is
a limitation to block this sequence:
if we do pre_reset() before VF FLR, it would go KIQ way to do register
access and stuck there, because KIQ probably won't work by that time
(e.g. you already made GFX hang)

so the best way right now is to simply remove it.

Signed-off-by: Monk Liu 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
  1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 605cef6..ae962b9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct 
amdgpu_device *adev,

  if (r)
  return r;
  -    amdgpu_amdkfd_pre_reset(adev);
-
  /* Resume IP prior to SMC */
  r = amdgpu_device_ip_reinit_early_sriov(adev);
  if (r)

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Cfelix.kuehling%40amd.com%7Cbd097404ba8b4e7f9d9308d7830717fe%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637121938908876710sdata=bNGTZtFLiQ46UwjCa5u8hXG1KUtK%2Fs98g7rBmBtTaPs%3Dreserved=0 


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

2019-12-17 Thread shaoyunl

I think amdkfd side depends on this call to stop the user queue, without this 
call, the user queue can submit to HW during the reset which could cause hang 
again ...
Do we know the root cause why this function would ruin MEC ? From the logic, I 
think this function should be called before FLR since we need to disable the 
user queue submission first.
I remembered the function should use hiq to communicate with HW , shouldn't use 
kiq to access HW registerm,  has this been changed ?


Regards
shaoyun.liu


On 2019-12-17 5:19 a.m., Monk Liu wrote:

issues:
MEC is ruined by the amdkfd_pre_reset after VF FLR done

fix:
amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF FLR,
the correct sequence is do amdkfd_pre_reset before VF FLR but there is
a limitation to block this sequence:
if we do pre_reset() before VF FLR, it would go KIQ way to do register
access and stuck there, because KIQ probably won't work by that time
(e.g. you already made GFX hang)

so the best way right now is to simply remove it.

Signed-off-by: Monk Liu 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
  1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 605cef6..ae962b9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device 
*adev,
if (r)
return r;
  
-	amdgpu_amdkfd_pre_reset(adev);

-
/* Resume IP prior to SMC */
r = amdgpu_device_ip_reinit_early_sriov(adev);
if (r)

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

2019-12-17 Thread Deng, Emily
[AMD Official Use Only - Internal Distribution Only]

Reviewed-by: Emily Deng 

>-Original Message-
>From: amd-gfx  On Behalf Of Monk Liu
>Sent: Tuesday, December 17, 2019 6:20 PM
>To: amd-gfx@lists.freedesktop.org
>Cc: Liu, Monk 
>Subject: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV
>
>issues:
>MEC is ruined by the amdkfd_pre_reset after VF FLR done
>
>fix:
>amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF FLR, the
>correct sequence is do amdkfd_pre_reset before VF FLR but there is a limitation
>to block this sequence:
>if we do pre_reset() before VF FLR, it would go KIQ way to do register access 
>and
>stuck there, because KIQ probably won't work by that time (e.g. you already
>made GFX hang)
>
>so the best way right now is to simply remove it.
>
>Signed-off-by: Monk Liu 
>---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
> 1 file changed, 2 deletions(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>index 605cef6..ae962b9 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>@@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct
>amdgpu_device *adev,
>   if (r)
>   return r;
>
>-  amdgpu_amdkfd_pre_reset(adev);
>-
>   /* Resume IP prior to SMC */
>   r = amdgpu_device_ip_reinit_early_sriov(adev);
>   if (r)
>--
>2.7.4
>
>___
>amd-gfx mailing list
>amd-gfx@lists.freedesktop.org
>https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.fre
>edesktop.org%2Fmailman%2Flistinfo%2Famd-
>gfxdata=02%7C01%7CEmily.Deng%40amd.com%7C74408803b49e4f328
>f7708d782daba6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6
>37121748318124859sdata=4YbyHwEEGxVLEhuOg%2Frc%2FxdhFRwrdm
>FuZ4vpHx%2FApAE%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx