Re: [PATCH v2] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-02-02 Thread Guilherme G. Piccoli
On 02/02/2023 08:58, Christian König wrote:
> [...]
>> +if (!ring->no_scheduler && ring->sched.ops)
>>  drm_sched_fini(&ring->sched);
> 
> I think we should drop the check for no_scheduler here and just call 
> drm_sched_fini() when the scheduler instance was initialized before.
> 
> Background is that I've found a couple of places where we potentially 
> set no_scheduler to false after the scheduler was already initialized.
> 
> This is then probably leading to a memory leak or worth.
> 
> Regards,
> Christian.

Thanks Christian, nice finding! And thanks for the reviews Guchun / Luben =)

I just submitted a V3 [0], but didn't add the review tags from Guchun /
Luben, since the patch changed.

Cheers,


Guilherme


[0]
https://lore.kernel.org/dri-devel/20230202134856.1382169-1-gpicc...@igalia.com/


Re: [PATCH v2] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-02-02 Thread Christian König

Am 01.02.23 um 17:48 schrieb Guilherme G. Piccoli:

Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
routine - such function is expected to be called only after the
respective init function - drm_sched_init() - was executed successfully.

Happens that we faced a driver probe failure in the Steam Deck
recently, and the function drm_sched_fini() was called even without
its counter-part had been previously called, causing the following oops:

amdgpu: probe of :04:00.0 failed with error -110
BUG: kernel NULL pointer dereference, address: 0090
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
[...]
Call Trace:
  
  amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
  amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
  devm_drm_dev_init_release+0x49/0x70
  [...]

To prevent that, check if the drm_sched was properly initialized for a
given ring before calling its fini counter-part.

Notice ideally we'd use sched.ready for that; such field is set as the latest
thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
field - in the above oops for example, it was a GFX ring causing the crash, and
the sched.ready field was set to true in the ring init routine, regardless of
the state of the DRM scheduler. Hence, we ended-up using sched.ops as per
Christian's suggestion [0].

[0] 
https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb...@amd.com/

Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test 
(v2)")
Suggested-by: Christian König 
Cc: Guchun Chen 
Cc: Luben Tuikov 
Cc: Mario Limonciello 
Signed-off-by: Guilherme G. Piccoli 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 00444203220d..3b962cb680a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device 
*adev)
if (!ring || !ring->fence_drv.initialized)
continue;
  
-		if (!ring->no_scheduler)

+   /*
+* Notice we check for sched.ops since there's some
+* override on the meaning of sched.ready by amdgpu.
+* The natural check would be sched.ready, which is
+* set as drm_sched_init() finishes...
+*/
+   if (!ring->no_scheduler && ring->sched.ops)
drm_sched_fini(&ring->sched);


I think we should drop the check for no_scheduler here and just call 
drm_sched_fini() when the scheduler instance was initialized before.


Background is that I've found a couple of places where we potentially 
set no_scheduler to false after the scheduler was already initialized.


This is then probably leading to a memory leak or worth.

Regards,
Christian.

  
  		for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)




RE: [PATCH v2] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-02-01 Thread Chen, Guchun
Reviewed-by: Guchun Chen 

Regards,
Guchun

-Original Message-
From: Guilherme G. Piccoli  
Sent: Thursday, February 2, 2023 12:48 AM
To: amd-...@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org; Deucher, Alexander 
; Koenig, Christian ; Pan, 
Xinhui ; ker...@gpiccoli.net; kernel-...@igalia.com; 
Guilherme G. Piccoli ; Chen, Guchun ; 
Tuikov, Luben ; Limonciello, Mario 

Subject: [PATCH v2] drm/amdgpu/fence: Fix oops due to non-matching drm_sched 
init/fini

Currently amdgpu calls drm_sched_fini() from the fence driver sw fini routine - 
such function is expected to be called only after the respective init function 
- drm_sched_init() - was executed successfully.

Happens that we faced a driver probe failure in the Steam Deck recently, and 
the function drm_sched_fini() was called even without its counter-part had been 
previously called, causing the following oops:

amdgpu: probe of :04:00.0 failed with error -110
BUG: kernel NULL pointer dereference, address: 0090 PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338 
Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace:
 
 amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
 amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
 amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
 devm_drm_dev_init_release+0x49/0x70
 [...]

To prevent that, check if the drm_sched was properly initialized for a given 
ring before calling its fini counter-part.

Notice ideally we'd use sched.ready for that; such field is set as the latest 
thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such 
field - in the above oops for example, it was a GFX ring causing the crash, and 
the sched.ready field was set to true in the ring init routine, regardless of 
the state of the DRM scheduler. Hence, we ended-up using sched.ops as per 
Christian's suggestion [0].

[0] 
https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb...@amd.com/

Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 
test (v2)")
Suggested-by: Christian König 
Cc: Guchun Chen 
Cc: Luben Tuikov 
Cc: Mario Limonciello 
Signed-off-by: Guilherme G. Piccoli 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 00444203220d..3b962cb680a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device 
*adev)
if (!ring || !ring->fence_drv.initialized)
continue;
 
-   if (!ring->no_scheduler)
+   /*
+* Notice we check for sched.ops since there's some
+* override on the meaning of sched.ready by amdgpu.
+* The natural check would be sched.ready, which is
+* set as drm_sched_init() finishes...
+*/
+   if (!ring->no_scheduler && ring->sched.ops)
drm_sched_fini(&ring->sched);
 
for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
--
2.39.0



Re: [PATCH v2] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-02-01 Thread Luben Tuikov
Reviewed-by: Luben Tuikov 

Regards,
Luben

On 2023-02-01 11:48, Guilherme G. Piccoli wrote:
> Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
> routine - such function is expected to be called only after the
> respective init function - drm_sched_init() - was executed successfully.
>
> Happens that we faced a driver probe failure in the Steam Deck
> recently, and the function drm_sched_fini() was called even without
> its counter-part had been previously called, causing the following oops:
>
> amdgpu: probe of :04:00.0 failed with error -110
> BUG: kernel NULL pointer dereference, address: 0090
> PGD 0 P4D 0
> Oops: 0002 [#1] PREEMPT SMP NOPTI
> CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
> Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
> RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
> [...]
> Call Trace:
>  
>  amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
>  amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
>  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
>  devm_drm_dev_init_release+0x49/0x70
>  [...]
>
> To prevent that, check if the drm_sched was properly initialized for a
> given ring before calling its fini counter-part.
>
> Notice ideally we'd use sched.ready for that; such field is set as the latest
> thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
> field - in the above oops for example, it was a GFX ring causing the crash, 
> and
> the sched.ready field was set to true in the ring init routine, regardless of
> the state of the DRM scheduler. Hence, we ended-up using sched.ops as per
> Christian's suggestion [0].
>
> [0] 
> https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb...@amd.com/
>
> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in 
> s3 test (v2)")
> Suggested-by: Christian König 
> Cc: Guchun Chen 
> Cc: Luben Tuikov 
> Cc: Mario Limonciello 
> Signed-off-by: Guilherme G. Piccoli 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 00444203220d..3b962cb680a6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device 
> *adev)
>   if (!ring || !ring->fence_drv.initialized)
>   continue;
>  
> - if (!ring->no_scheduler)
> + /*
> +  * Notice we check for sched.ops since there's some
> +  * override on the meaning of sched.ready by amdgpu.
> +  * The natural check would be sched.ready, which is
> +  * set as drm_sched_init() finishes...
> +  */
> + if (!ring->no_scheduler && ring->sched.ops)
>   drm_sched_fini(&ring->sched);
>  
>   for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)