RE: [PATCH 1/2] drm/amdgpu: fix reload KMD hang on KIQ

2020-08-04 Thread Deng, Emily
[AMD Official Use Only - Internal Distribution Only]

>-Original Message-
>From: amd-gfx  On Behalf Of Liu,
>Monk
>Sent: Tuesday, August 4, 2020 2:31 PM
>To: amd-gfx@lists.freedesktop.org
>Subject: RE: [PATCH 1/2] drm/amdgpu: fix reload KMD hang on KIQ
>
>[AMD Official Use Only - Internal Distribution Only]
>
>[AMD Official Use Only - Internal Distribution Only]
>
>Ping ... this is a severe bug fix
>
>_
>Monk Liu|GPU Virtualization Team |AMD
>
>
>-Original Message-
>From: amd-gfx  On Behalf Of Liu,
>Monk
>Sent: Monday, August 3, 2020 9:55 AM
>To: Kuehling, Felix ; amd-gfx@lists.freedesktop.org
>Subject: RE: [PATCH 1/2] drm/amdgpu: fix reload KMD hang on KIQ
>
>[AMD Official Use Only - Internal Distribution Only]
>
>[AMD Official Use Only - Internal Distribution Only]
>
>>>In gfx_v10_0_sw_fini the KIQ ring gets freed. Wouldn't that be the
>>>right place to stop the KIQ
>
>KIQ (CPC) will never being stopped (the DISABLE on CPC is skipped for SRIOV )
>for SRIOV in SW_FINI because SRIOV relies on KIQ to do world switch
>
>But this is really a weird bug because even with the same approach it doesn't
>make KIQ (CP) hang on GFX9, only GFX10 need this patch 
>
>By now I do not have other good explanation or better fix than this one
>
>_
>Monk Liu|GPU Virtualization Team |AMD
>
>
>-----Original Message-
>From: Kuehling, Felix 
>Sent: Friday, July 31, 2020 9:57 PM
>To: Liu, Monk ; amd-gfx@lists.freedesktop.org
>Subject: Re: [PATCH 1/2] drm/amdgpu: fix reload KMD hang on KIQ
>
>In gfx_v10_0_sw_fini the KIQ ring gets freed. Wouldn't that be the right place
>to stop the KIQ? Otherwise KIQ will hang as soon as someone allocates the
>memory that was previously used for the KIQ ring buffer and overwrites it with
>something that's not a valid PM4 packet.
>
>Regards,
>  Felix
>
>Am 2020-07-31 um 3:51 a.m. schrieb Monk Liu:
>> KIQ will hang if we try below steps:
>> modprobe amdgpu
>> rmmod amdgpu
>> modprobe amdgpu sched_hw_submission=4
>>
>> the cause is that due to KIQ is always living there even after we
>> unload KMD thus when doing the realod of KMD KIQ will crash upon its
>> register programed with different values with the previous
>> configuration (the config like HQD addr, ring size, is easily changed
>> if we alter the sched_hw_submission)
>>
>> the fix is we must inactive KIQ first before touching any of its
>> registgers
>>
>> Signed-off-by: Monk Liu 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> index db9f1e8..f571e25 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> @@ -6433,6 +6433,9 @@ static int gfx_v10_0_kiq_init_register(struct
>> amdgpu_ring *ring)  struct v10_compute_mqd *mqd = ring->mqd_ptr;  int
>> j;
>>
>> +/* activate the queue */
>> +WREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE, 0);
>> +
Could we move follow to here?
if (RREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE) & 1) {
WREG32_SOC15(GC, 0, mmCP_HQD_DEQUEUE_REQUEST, 1);
for (j = 0; j < adev->usec_timeout; j++) {
if (!(RREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE) & 1))
break;
udelay(1);
}
>>  /* disable wptr polling */
>>  WREG32_FIELD15(GC, 0, CP_PQ_WPTR_POLL_CNTL, EN, 0);
>>
>___
>amd-gfx mailing list
>amd-gfx@lists.freedesktop.org
>https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.fre
>edesktop.org%2Fmailman%2Flistinfo%2Famd-
>gfxdata=02%7C01%7CEmily.Deng%40amd.com%7C1236f42617d246b20
>bc108d8384007e4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7
>C637321194957236933sdata=0%2BzHvJ1n4TZOYss4v1pR6i8bxq46JE73
>UIi%2B49x9joU%3Dreserved=0
>___
>amd-gfx mailing list
>amd-gfx@lists.freedesktop.org
>https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.fre
>edesktop.org%2Fmailman%2Flistinfo%2Famd-
>gfxdata=02%7C01%7CEmily.Deng%40amd.com%7C1236f42617d246b20
>bc108d8384007e4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7
>C637321194957236933sdata=0%2BzHvJ1n4TZOYss4v1pR6i8bxq46JE73
>UIi%2B49x9joU%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 1/2] drm/amdgpu: fix reload KMD hang on KIQ

2020-08-04 Thread Liu, Monk
[AMD Official Use Only - Internal Distribution Only]

Ping ... this is a severe bug fix

_
Monk Liu|GPU Virtualization Team |AMD


-Original Message-
From: amd-gfx  On Behalf Of Liu, Monk
Sent: Monday, August 3, 2020 9:55 AM
To: Kuehling, Felix ; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 1/2] drm/amdgpu: fix reload KMD hang on KIQ

[AMD Official Use Only - Internal Distribution Only]

[AMD Official Use Only - Internal Distribution Only]

>>In gfx_v10_0_sw_fini the KIQ ring gets freed. Wouldn't that be the
>>right place to stop the KIQ

KIQ (CPC) will never being stopped (the DISABLE on CPC is skipped for SRIOV ) 
for SRIOV in SW_FINI because SRIOV relies on KIQ to do world switch

But this is really a weird bug because even with the same approach it doesn't 
make KIQ (CP) hang on GFX9, only GFX10 need this patch 

By now I do not have other good explanation or better fix than this one

_
Monk Liu|GPU Virtualization Team |AMD


-Original Message-
From: Kuehling, Felix 
Sent: Friday, July 31, 2020 9:57 PM
To: Liu, Monk ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/2] drm/amdgpu: fix reload KMD hang on KIQ

In gfx_v10_0_sw_fini the KIQ ring gets freed. Wouldn't that be the right place 
to stop the KIQ? Otherwise KIQ will hang as soon as someone allocates the 
memory that was previously used for the KIQ ring buffer and overwrites it with 
something that's not a valid PM4 packet.

Regards,
  Felix

Am 2020-07-31 um 3:51 a.m. schrieb Monk Liu:
> KIQ will hang if we try below steps:
> modprobe amdgpu
> rmmod amdgpu
> modprobe amdgpu sched_hw_submission=4
>
> the cause is that due to KIQ is always living there even after we
> unload KMD thus when doing the realod of KMD KIQ will crash upon its
> register programed with different values with the previous
> configuration (the config like HQD addr, ring size, is easily changed
> if we alter the sched_hw_submission)
>
> the fix is we must inactive KIQ first before touching any of its
> registgers
>
> Signed-off-by: Monk Liu 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> index db9f1e8..f571e25 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> @@ -6433,6 +6433,9 @@ static int gfx_v10_0_kiq_init_register(struct
> amdgpu_ring *ring)  struct v10_compute_mqd *mqd = ring->mqd_ptr;  int
> j;
>
> +/* activate the queue */
> +WREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE, 0);
> +
>  /* disable wptr polling */
>  WREG32_FIELD15(GC, 0, CP_PQ_WPTR_POLL_CNTL, EN, 0);
>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Cmonk.liu%40amd.com%7C4837e2d566b44af845f608d837503a3b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637320165018899834sdata=TED%2BkhlYyAIyTmLJAZBBBHHnE6PRg4fpUsZhD9ke%2BPU%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 1/2] drm/amdgpu: fix reload KMD hang on KIQ

2020-08-02 Thread Liu, Monk
[AMD Official Use Only - Internal Distribution Only]

>>In gfx_v10_0_sw_fini the KIQ ring gets freed. Wouldn't that be the right 
>>place to stop the KIQ

KIQ (CPC) will never being stopped (the DISABLE on CPC is skipped for SRIOV ) 
for SRIOV in SW_FINI because SRIOV relies on KIQ to do world switch

But this is really a weird bug because even with the same approach it doesn't 
make KIQ (CP) hang on GFX9, only GFX10 need this patch 

By now I do not have other good explanation or better fix than this one

_
Monk Liu|GPU Virtualization Team |AMD


-Original Message-
From: Kuehling, Felix 
Sent: Friday, July 31, 2020 9:57 PM
To: Liu, Monk ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/2] drm/amdgpu: fix reload KMD hang on KIQ

In gfx_v10_0_sw_fini the KIQ ring gets freed. Wouldn't that be the right place 
to stop the KIQ? Otherwise KIQ will hang as soon as someone allocates the 
memory that was previously used for the KIQ ring buffer and overwrites it with 
something that's not a valid PM4 packet.

Regards,
  Felix

Am 2020-07-31 um 3:51 a.m. schrieb Monk Liu:
> KIQ will hang if we try below steps:
> modprobe amdgpu
> rmmod amdgpu
> modprobe amdgpu sched_hw_submission=4
>
> the cause is that due to KIQ is always living there even after we
> unload KMD thus when doing the realod of KMD KIQ will crash upon its
> register programed with different values with the previous
> configuration (the config like HQD addr, ring size, is easily changed
> if we alter the sched_hw_submission)
>
> the fix is we must inactive KIQ first before touching any of its
> registgers
>
> Signed-off-by: Monk Liu 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> index db9f1e8..f571e25 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> @@ -6433,6 +6433,9 @@ static int gfx_v10_0_kiq_init_register(struct 
> amdgpu_ring *ring)
>  struct v10_compute_mqd *mqd = ring->mqd_ptr;
>  int j;
>
> +/* activate the queue */
> +WREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE, 0);
> +
>  /* disable wptr polling */
>  WREG32_FIELD15(GC, 0, CP_PQ_WPTR_POLL_CNTL, EN, 0);
>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 1/2] drm/amdgpu: fix reload KMD hang on KIQ

2020-07-31 Thread Felix Kuehling
In gfx_v10_0_sw_fini the KIQ ring gets freed. Wouldn't that be the right
place to stop the KIQ? Otherwise KIQ will hang as soon as someone
allocates the memory that was previously used for the KIQ ring buffer
and overwrites it with something that's not a valid PM4 packet.

Regards,
  Felix

Am 2020-07-31 um 3:51 a.m. schrieb Monk Liu:
> KIQ will hang if we try below steps:
> modprobe amdgpu
> rmmod amdgpu
> modprobe amdgpu sched_hw_submission=4
>
> the cause is that due to KIQ is always living there even
> after we unload KMD thus when doing the realod of KMD
> KIQ will crash upon its register programed with different
> values with the previous configuration (the config
> like HQD addr, ring size, is easily changed if we alter
> the sched_hw_submission)
>
> the fix is we must inactive KIQ first before touching any
> of its registgers
>
> Signed-off-by: Monk Liu 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> index db9f1e8..f571e25 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> @@ -6433,6 +6433,9 @@ static int gfx_v10_0_kiq_init_register(struct 
> amdgpu_ring *ring)
>   struct v10_compute_mqd *mqd = ring->mqd_ptr;
>   int j;
>  
> + /* activate the queue */
> + WREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE, 0);
> +
>   /* disable wptr polling */
>   WREG32_FIELD15(GC, 0, CP_PQ_WPTR_POLL_CNTL, EN, 0);
>  
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx