Re: Radeon regression in 6.6 kernel

2023-12-12 Thread Alex Deucher
On Mon, Dec 11, 2023 at 7:28 PM Phillip Susi  wrote:
>
> Phillip Susi  writes:
>
> > And it works, but 6.7-rc5 does not, even though it includes that patch.
> > Here's the syslog from the attempt.  I'll start bisecting again.
>
> I checked out the patch that got merged upstream and it also fails.  I
> looked at the two commits, and I see what happened.  Your original patch
> MOVED the call to amdgpu_ttm_set_buffer_funcs_status().  The one that
> got merged into Linus' tree instead DUPLICATES the call.  Oops?
>

Yeah, I messed up the patch somehow when I applied it.  Fix up already
queued up to add the missing chunk.

Alex


Re: Radeon regression in 6.6 kernel

2023-12-11 Thread Phillip Susi
Phillip Susi  writes:

> And it works, but 6.7-rc5 does not, even though it includes that patch.
> Here's the syslog from the attempt.  I'll start bisecting again.

I checked out the patch that got merged upstream and it also fails.  I
looked at the two commits, and I see what happened.  Your original patch
MOVED the call to amdgpu_ttm_set_buffer_funcs_status().  The one that
got merged into Linus' tree instead DUPLICATES the call.  Oops?



Re: Radeon regression in 6.6 kernel

2023-12-04 Thread Alex Deucher
On Sun, Dec 3, 2023 at 3:40 PM Phillip Susi  wrote:
>
> Alex Deucher  writes:
>
> > Phillip,
> >
> > Can you test this patch?  I was not able to repro the issue on the
> > navi2x card I had handy, but I think it should fix it.
>
> I pulled -rc4 and it still had the problem.  I applied this patch, and
> yes, it fixed it!
>

Great!  I'll include it this week.

Alex


Re: Radeon regression in 6.6 kernel

2023-12-03 Thread Phillip Susi
Alex Deucher  writes:

> Phillip,
>
> Can you test this patch?  I was not able to repro the issue on the
> navi2x card I had handy, but I think it should fix it.

I pulled -rc4 and it still had the problem.  I applied this patch, and
yes, it fixed it!



Re: Radeon regression in 6.6 kernel

2023-12-01 Thread Alex Deucher
Phillip,

Can you test this patch?  I was not able to repro the issue on the
navi2x card I had handy, but I think it should fix it.

Thanks,

Alex

On Wed, Nov 29, 2023 at 3:49 PM Alex Deucher  wrote:
>
> On Wed, Nov 29, 2023 at 3:10 PM Alex Deucher  wrote:
> >
> > Actually I think I see the problem.  I'll try and send out a patch
> > later today to test.
>
> Does the attached patch fix it?
>
> Alex
>
> >
> > Alex
> >
> > On Wed, Nov 29, 2023 at 1:52 PM Alex Deucher  wrote:
> > >
> > > On Wed, Nov 29, 2023 at 11:41 AM Luben Tuikov  wrote:
> > > >
> > > > On 2023-11-29 10:22, Alex Deucher wrote:
> > > > > On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher  
> > > > > wrote:
> > > > >>
> > > > >> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov  
> > > > >> wrote:
> > > > >>>
> > > > >>> On 2023-11-28 17:13, Alex Deucher wrote:
> > > >  On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  
> > > >  wrote:
> > > > >
> > > > > Alex Deucher  writes:
> > > > >
> > > > >>> In that case those are the already known problems with the 
> > > > >>> scheduler
> > > > >>> changes, aren't they?
> > > > >>
> > > > >> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe 
> > > > >> I'm
> > > > >> misunderstanding what the original report was actually testing.  
> > > > >> If it
> > > > >> was 6.7, then try reverting:
> > > > >> 56e449603f0ac580700621a356d35d5716a62ce5
> > > > >> b70438004a14f4d0f9890b3297cd66248728546c
> > > > >
> > > > > At some point it was suggested that I file a gitlab issue, but I 
> > > > > took
> > > > > this to mean it was already known and being worked on.  -rc3 came 
> > > > > out
> > > > > today and still has the problem.  Is there a known issue I could 
> > > > > track?
> > > > >
> > > > 
> > > >  At this point, unless there are any objections, I think we should 
> > > >  just
> > > >  revert the two patches
> > > > >>> Uhm, no.
> > > > >>>
> > > > >>> Why "the two" patches?
> > > > >>>
> > > > >>> This email, part of this thread,
> > > > >>>
> > > > >>> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/
> > > > >>>
> > > > >>> clearly states that reverting *only* this commit,
> > > > >>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable 
> > > > >>> number of run-queues
> > > > >>> *does not* mitigate the failed suspend. (Furthermore, this commit 
> > > > >>> doesn't really change
> > > > >>> anything operational, other than using an allocated array, instead 
> > > > >>> of a static one, in DRM,
> > > > >>> while the 2nd patch is solely contained within the amdgpu driver 
> > > > >>> code.)
> > > > >>>
> > > > >>> Leaving us with only this change,
> > > > >>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level
> > > > >>> to be at fault, as the kernel log attached in the linked email 
> > > > >>> above shows.
> > > > >>>
> > > > >>> The conclusion is that only b70438004a14f4 needs reverting.
> > > > >>
> > > > >> b70438004a14f4 was a fix for 56e449603f0ac5.  Without b70438004a14f4,
> > > > >> 56e449603f0ac5 breaks amdgpu.
> > > > >
> > > > > We can try and re-enable it in the next kernel.  I'm just not sure
> > > > > we'll be able to fix this in time for 6.7 with the holidays and all
> > > > > and I don't want to cause a lot of scheduler churn at the end of the
> > > > > 6.7 cycle if we hold off and try and fix it.  Reverting seems like the
> > > > > best short term solution.
> > > >
> > > > A lot of subsequent code has come in since commit 56e449603f0ac5, as it 
> > > > opened
> > > > the opportunity for a 1-to-1 relationship between an entity and a 
> > > > scheduler.
> > > > (Should've always been the case, from the outset. Not sure why it was 
> > > > coded as
> > > > a fixed-size array.)
> > > >
> > > > Given that commit 56e449603f0ac5 has nothing to do with amdgpu, and the 
> > > > problem
> > > > is wholly contained in amdgpu, and no other driver has this problem, 
> > > > there is
> > > > no reason to have to "churn", i.e. go back and forth in DRM, only to 
> > > > cover up
> > > > an init bug in amdgpu. See the response I just sent in @this thread:
> > > > https://lore.kernel.org/r/05007cb0-871e-4dc7-af58-1351f4ba4...@gmail.com
> > > >
> > > > And it's not like this issue is unknown. I first posted about it on 
> > > > 2023-10-16.
> > > >
> > > > Ideally, amdgpu would just fix their init code.
> > >
> > > You can't make changes to core code that break other drivers.
> > > Arguably 56e449603f0ac5 should not have gone in in the first place if
> > > it broke amdgpu.  b70438004a14f4 was the code to fix amdgpu's init
> > > code, but as a side effect it seems to have broken suspend for some
> > > users.
> > >
> > > Alex


Re: Radeon regression in 6.6 kernel

2023-11-30 Thread Alex Deucher
On Wed, Nov 29, 2023 at 10:47 PM Luben Tuikov  wrote:
>
> On 2023-11-29 22:36, Luben Tuikov wrote:
> > On 2023-11-29 15:49, Alex Deucher wrote:
> >> On Wed, Nov 29, 2023 at 3:10 PM Alex Deucher  wrote:
> >>>
> >>> Actually I think I see the problem.  I'll try and send out a patch
> >>> later today to test.
> >>
> >> Does the attached patch fix it?
> >
> > Thanks for the patch, Alex.
> >
> > Is it possible for AMD to also reproduce this issue and test this patch on 
> > a Navi23 system?
> >
> >> From 96e75b5218f7a124eafa53853681eef8fe567ab8 Mon Sep 17 00:00:00 2001
> >> From: Alex Deucher 
> >> Date: Wed, 29 Nov 2023 15:44:25 -0500
> >> Subject: [PATCH] drm/amdgpu: fix buffer funcs setting order on suspend
> >>
> >> We need to make disable this after the last eviction
> >
> > "make disable" --> "disable"
> >
> >> call, but before we disable the SDMA IP.
> >>
> >> Fixes: b70438004a14 ("drm/amdgpu: move buffer funcs setting up a level")
> >> Link: 
> >> https://lists.freedesktop.org/archives/amd-gfx/2023-November/101197.html
> >
> > Link: https://lore.kernel.org/r/87edgv4x3i@vps.thesusis.net
> >
> > Let's link the start of the thread.
> >
> > Regards,
> > Luben
> >
> >> Signed-off-by: Alex Deucher 
> >> Cc: Phillip Susi 
> >> Cc: Luben Tuikov 
> >> ---
> >>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
> >>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> index b5edf40b5d03..78553e027db4 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> @@ -4531,8 +4531,6 @@ int amdgpu_device_suspend(struct drm_device *dev, 
> >> bool fbcon)
> >>
> >>  amdgpu_ras_suspend(adev);
> >>
> >> -amdgpu_ttm_set_buffer_funcs_status(adev, false);
> >> -
> >>  amdgpu_device_ip_suspend_phase1(adev);
> >>
> >>  if (!adev->in_s0ix)
> >> @@ -4542,6 +4540,8 @@ int amdgpu_device_suspend(struct drm_device *dev, 
> >> bool fbcon)
> >>  if (r)
> >>  return r;
> >>
> >> +amdgpu_ttm_set_buffer_funcs_status(adev, false);
> >> +
>
> If you're moving this past phase 1, there's another instance in 
> amdgpu_device_ip_suspend(),
> which may need to be moved down.

I think that one should be ok since we don't do any evictions in
amdgpu_device_ip_suspend().

Alex

>
> Regards,
> Luben
>
> >>  amdgpu_fence_driver_hw_fini(adev);
> >>
> >>  amdgpu_device_ip_suspend_phase2(adev);
> >
> >>
> >> Alex
> >>
> >>>
> >>> Alex
> >>>
> >>> On Wed, Nov 29, 2023 at 1:52 PM Alex Deucher  
> >>> wrote:
> 
>  On Wed, Nov 29, 2023 at 11:41 AM Luben Tuikov  
>  wrote:
> >
> > On 2023-11-29 10:22, Alex Deucher wrote:
> >> On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher  
> >> wrote:
> >>>
> >>> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov  
> >>> wrote:
> 
>  On 2023-11-28 17:13, Alex Deucher wrote:
> > On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  
> > wrote:
> >>
> >> Alex Deucher  writes:
> >>
>  In that case those are the already known problems with the 
>  scheduler
>  changes, aren't they?
> >>>
> >>> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe 
> >>> I'm
> >>> misunderstanding what the original report was actually testing.  
> >>> If it
> >>> was 6.7, then try reverting:
> >>> 56e449603f0ac580700621a356d35d5716a62ce5
> >>> b70438004a14f4d0f9890b3297cd66248728546c
> >>
> >> At some point it was suggested that I file a gitlab issue, but I 
> >> took
> >> this to mean it was already known and being worked on.  -rc3 came 
> >> out
> >> today and still has the problem.  Is there a known issue I could 
> >> track?
> >>
> >
> > At this point, unless there are any objections, I think we should 
> > just
> > revert the two patches
>  Uhm, no.
> 
>  Why "the two" patches?
> 
>  This email, part of this thread,
> 
>  https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/
> 
>  clearly states that reverting *only* this commit,
>  56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable 
>  number of run-queues
>  *does not* mitigate the failed suspend. (Furthermore, this commit 
>  doesn't really change
>  anything operational, other than using an allocated array, instead 
>  of a static one, in DRM,
>  while the 2nd patch is solely contained within the amdgpu driver 
>  code.)
> 
>  Leaving us with only this change,
>  b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level
>  to be at fault, as the kernel log attached in the linked email above 

Re: Radeon regression in 6.6 kernel

2023-11-30 Thread Alex Deucher
On Wed, Nov 29, 2023 at 10:36 PM Luben Tuikov  wrote:
>
> On 2023-11-29 15:49, Alex Deucher wrote:
> > On Wed, Nov 29, 2023 at 3:10 PM Alex Deucher  wrote:
> >>
> >> Actually I think I see the problem.  I'll try and send out a patch
> >> later today to test.
> >
> > Does the attached patch fix it?
>
> Thanks for the patch, Alex.
>
> Is it possible for AMD to also reproduce this issue and test this patch on a 
> Navi23 system?

I haven't had a chance to dig into it much due to LPC and thanksgiving
and other end of year stuff.

>
> > From 96e75b5218f7a124eafa53853681eef8fe567ab8 Mon Sep 17 00:00:00 2001
> > From: Alex Deucher 
> > Date: Wed, 29 Nov 2023 15:44:25 -0500
> > Subject: [PATCH] drm/amdgpu: fix buffer funcs setting order on suspend
> >
> > We need to make disable this after the last eviction
>
> "make disable" --> "disable"
>
> > call, but before we disable the SDMA IP.
> >
> > Fixes: b70438004a14 ("drm/amdgpu: move buffer funcs setting up a level")
> > Link: 
> > https://lists.freedesktop.org/archives/amd-gfx/2023-November/101197.html
>
> Link: https://lore.kernel.org/r/87edgv4x3i@vps.thesusis.net
>
> Let's link the start of the thread.

Thanks, I will update the patch.

Alex

>
> Regards,
> Luben
>
> > Signed-off-by: Alex Deucher 
> > Cc: Phillip Susi 
> > Cc: Luben Tuikov 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index b5edf40b5d03..78553e027db4 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -4531,8 +4531,6 @@ int amdgpu_device_suspend(struct drm_device *dev, 
> > bool fbcon)
> >
> >   amdgpu_ras_suspend(adev);
> >
> > - amdgpu_ttm_set_buffer_funcs_status(adev, false);
> > -
> >   amdgpu_device_ip_suspend_phase1(adev);
> >
> >   if (!adev->in_s0ix)
> > @@ -4542,6 +4540,8 @@ int amdgpu_device_suspend(struct drm_device *dev, 
> > bool fbcon)
> >   if (r)
> >   return r;
> >
> > + amdgpu_ttm_set_buffer_funcs_status(adev, false);
> > +
> >   amdgpu_fence_driver_hw_fini(adev);
> >
> >   amdgpu_device_ip_suspend_phase2(adev);
>
> >
> > Alex
> >
> >>
> >> Alex
> >>
> >> On Wed, Nov 29, 2023 at 1:52 PM Alex Deucher  wrote:
> >>>
> >>> On Wed, Nov 29, 2023 at 11:41 AM Luben Tuikov  wrote:
> 
>  On 2023-11-29 10:22, Alex Deucher wrote:
> > On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher  
> > wrote:
> >>
> >> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov  
> >> wrote:
> >>>
> >>> On 2023-11-28 17:13, Alex Deucher wrote:
>  On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  
>  wrote:
> >
> > Alex Deucher  writes:
> >
> >>> In that case those are the already known problems with the 
> >>> scheduler
> >>> changes, aren't they?
> >>
> >> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
> >> misunderstanding what the original report was actually testing.  
> >> If it
> >> was 6.7, then try reverting:
> >> 56e449603f0ac580700621a356d35d5716a62ce5
> >> b70438004a14f4d0f9890b3297cd66248728546c
> >
> > At some point it was suggested that I file a gitlab issue, but I 
> > took
> > this to mean it was already known and being worked on.  -rc3 came 
> > out
> > today and still has the problem.  Is there a known issue I could 
> > track?
> >
> 
>  At this point, unless there are any objections, I think we should 
>  just
>  revert the two patches
> >>> Uhm, no.
> >>>
> >>> Why "the two" patches?
> >>>
> >>> This email, part of this thread,
> >>>
> >>> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/
> >>>
> >>> clearly states that reverting *only* this commit,
> >>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable 
> >>> number of run-queues
> >>> *does not* mitigate the failed suspend. (Furthermore, this commit 
> >>> doesn't really change
> >>> anything operational, other than using an allocated array, instead of 
> >>> a static one, in DRM,
> >>> while the 2nd patch is solely contained within the amdgpu driver 
> >>> code.)
> >>>
> >>> Leaving us with only this change,
> >>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level
> >>> to be at fault, as the kernel log attached in the linked email above 
> >>> shows.
> >>>
> >>> The conclusion is that only b70438004a14f4 needs reverting.
> >>
> >> b70438004a14f4 was a fix for 56e449603f0ac5.  Without b70438004a14f4,
> >> 56e449603f0ac5 breaks amdgpu.
> >
> > We can try and re-enable it in the next kernel.  I'm just not sure
> > we'll be able to 

Re: Radeon regression in 6.6 kernel

2023-11-29 Thread Luben Tuikov
On 2023-11-29 22:36, Luben Tuikov wrote:
> On 2023-11-29 15:49, Alex Deucher wrote:
>> On Wed, Nov 29, 2023 at 3:10 PM Alex Deucher  wrote:
>>>
>>> Actually I think I see the problem.  I'll try and send out a patch
>>> later today to test.
>>
>> Does the attached patch fix it?
> 
> Thanks for the patch, Alex.
> 
> Is it possible for AMD to also reproduce this issue and test this patch on a 
> Navi23 system?
> 
>> From 96e75b5218f7a124eafa53853681eef8fe567ab8 Mon Sep 17 00:00:00 2001
>> From: Alex Deucher 
>> Date: Wed, 29 Nov 2023 15:44:25 -0500
>> Subject: [PATCH] drm/amdgpu: fix buffer funcs setting order on suspend
>>
>> We need to make disable this after the last eviction
> 
> "make disable" --> "disable"
> 
>> call, but before we disable the SDMA IP.
>>
>> Fixes: b70438004a14 ("drm/amdgpu: move buffer funcs setting up a level")
>> Link: 
>> https://lists.freedesktop.org/archives/amd-gfx/2023-November/101197.html
> 
> Link: https://lore.kernel.org/r/87edgv4x3i@vps.thesusis.net
> 
> Let's link the start of the thread.
> 
> Regards,
> Luben
> 
>> Signed-off-by: Alex Deucher 
>> Cc: Phillip Susi 
>> Cc: Luben Tuikov 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index b5edf40b5d03..78553e027db4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4531,8 +4531,6 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
>> fbcon)
>>  
>>  amdgpu_ras_suspend(adev);
>>  
>> -amdgpu_ttm_set_buffer_funcs_status(adev, false);
>> -
>>  amdgpu_device_ip_suspend_phase1(adev);
>>  
>>  if (!adev->in_s0ix)
>> @@ -4542,6 +4540,8 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
>> fbcon)
>>  if (r)
>>  return r;
>>  
>> +amdgpu_ttm_set_buffer_funcs_status(adev, false);
>> +

If you're moving this past phase 1, there's another instance in 
amdgpu_device_ip_suspend(),
which may need to be moved down.

Regards,
Luben

>>  amdgpu_fence_driver_hw_fini(adev);
>>  
>>  amdgpu_device_ip_suspend_phase2(adev);
> 
>>
>> Alex
>>
>>>
>>> Alex
>>>
>>> On Wed, Nov 29, 2023 at 1:52 PM Alex Deucher  wrote:

 On Wed, Nov 29, 2023 at 11:41 AM Luben Tuikov  wrote:
>
> On 2023-11-29 10:22, Alex Deucher wrote:
>> On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher  
>> wrote:
>>>
>>> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov  
>>> wrote:

 On 2023-11-28 17:13, Alex Deucher wrote:
> On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  
> wrote:
>>
>> Alex Deucher  writes:
>>
 In that case those are the already known problems with the 
 scheduler
 changes, aren't they?
>>>
>>> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
>>> misunderstanding what the original report was actually testing.  If 
>>> it
>>> was 6.7, then try reverting:
>>> 56e449603f0ac580700621a356d35d5716a62ce5
>>> b70438004a14f4d0f9890b3297cd66248728546c
>>
>> At some point it was suggested that I file a gitlab issue, but I took
>> this to mean it was already known and being worked on.  -rc3 came out
>> today and still has the problem.  Is there a known issue I could 
>> track?
>>
>
> At this point, unless there are any objections, I think we should just
> revert the two patches
 Uhm, no.

 Why "the two" patches?

 This email, part of this thread,

 https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/

 clearly states that reverting *only* this commit,
 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number 
 of run-queues
 *does not* mitigate the failed suspend. (Furthermore, this commit 
 doesn't really change
 anything operational, other than using an allocated array, instead of 
 a static one, in DRM,
 while the 2nd patch is solely contained within the amdgpu driver code.)

 Leaving us with only this change,
 b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level
 to be at fault, as the kernel log attached in the linked email above 
 shows.

 The conclusion is that only b70438004a14f4 needs reverting.
>>>
>>> b70438004a14f4 was a fix for 56e449603f0ac5.  Without b70438004a14f4,
>>> 56e449603f0ac5 breaks amdgpu.
>>
>> We can try and re-enable it in the next kernel.  I'm just not sure
>> we'll be able to fix this in time for 6.7 with the holidays and all
>> and I don't want to cause a lot of scheduler churn at the end of the
>> 6.7 cycle if we hold off and try and 

Re: Radeon regression in 6.6 kernel

2023-11-29 Thread Luben Tuikov
On 2023-11-29 15:49, Alex Deucher wrote:
> On Wed, Nov 29, 2023 at 3:10 PM Alex Deucher  wrote:
>>
>> Actually I think I see the problem.  I'll try and send out a patch
>> later today to test.
> 
> Does the attached patch fix it?

Thanks for the patch, Alex.

Is it possible for AMD to also reproduce this issue and test this patch on a 
Navi23 system?

> From 96e75b5218f7a124eafa53853681eef8fe567ab8 Mon Sep 17 00:00:00 2001
> From: Alex Deucher 
> Date: Wed, 29 Nov 2023 15:44:25 -0500
> Subject: [PATCH] drm/amdgpu: fix buffer funcs setting order on suspend
> 
> We need to make disable this after the last eviction

"make disable" --> "disable"

> call, but before we disable the SDMA IP.
> 
> Fixes: b70438004a14 ("drm/amdgpu: move buffer funcs setting up a level")
> Link: https://lists.freedesktop.org/archives/amd-gfx/2023-November/101197.html

Link: https://lore.kernel.org/r/87edgv4x3i@vps.thesusis.net

Let's link the start of the thread.

Regards,
Luben

> Signed-off-by: Alex Deucher 
> Cc: Phillip Susi 
> Cc: Luben Tuikov 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index b5edf40b5d03..78553e027db4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4531,8 +4531,6 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
> fbcon)
>  
>   amdgpu_ras_suspend(adev);
>  
> - amdgpu_ttm_set_buffer_funcs_status(adev, false);
> -
>   amdgpu_device_ip_suspend_phase1(adev);
>  
>   if (!adev->in_s0ix)
> @@ -4542,6 +4540,8 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
> fbcon)
>   if (r)
>   return r;
>  
> + amdgpu_ttm_set_buffer_funcs_status(adev, false);
> +
>   amdgpu_fence_driver_hw_fini(adev);
>  
>   amdgpu_device_ip_suspend_phase2(adev);

> 
> Alex
> 
>>
>> Alex
>>
>> On Wed, Nov 29, 2023 at 1:52 PM Alex Deucher  wrote:
>>>
>>> On Wed, Nov 29, 2023 at 11:41 AM Luben Tuikov  wrote:

 On 2023-11-29 10:22, Alex Deucher wrote:
> On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher  
> wrote:
>>
>> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov  
>> wrote:
>>>
>>> On 2023-11-28 17:13, Alex Deucher wrote:
 On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  
 wrote:
>
> Alex Deucher  writes:
>
>>> In that case those are the already known problems with the scheduler
>>> changes, aren't they?
>>
>> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
>> misunderstanding what the original report was actually testing.  If 
>> it
>> was 6.7, then try reverting:
>> 56e449603f0ac580700621a356d35d5716a62ce5
>> b70438004a14f4d0f9890b3297cd66248728546c
>
> At some point it was suggested that I file a gitlab issue, but I took
> this to mean it was already known and being worked on.  -rc3 came out
> today and still has the problem.  Is there a known issue I could 
> track?
>

 At this point, unless there are any objections, I think we should just
 revert the two patches
>>> Uhm, no.
>>>
>>> Why "the two" patches?
>>>
>>> This email, part of this thread,
>>>
>>> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/
>>>
>>> clearly states that reverting *only* this commit,
>>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number 
>>> of run-queues
>>> *does not* mitigate the failed suspend. (Furthermore, this commit 
>>> doesn't really change
>>> anything operational, other than using an allocated array, instead of a 
>>> static one, in DRM,
>>> while the 2nd patch is solely contained within the amdgpu driver code.)
>>>
>>> Leaving us with only this change,
>>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level
>>> to be at fault, as the kernel log attached in the linked email above 
>>> shows.
>>>
>>> The conclusion is that only b70438004a14f4 needs reverting.
>>
>> b70438004a14f4 was a fix for 56e449603f0ac5.  Without b70438004a14f4,
>> 56e449603f0ac5 breaks amdgpu.
>
> We can try and re-enable it in the next kernel.  I'm just not sure
> we'll be able to fix this in time for 6.7 with the holidays and all
> and I don't want to cause a lot of scheduler churn at the end of the
> 6.7 cycle if we hold off and try and fix it.  Reverting seems like the
> best short term solution.

 A lot of subsequent code has come in since commit 56e449603f0ac5, as it 
 opened
 the opportunity for a 1-to-1 relationship between an entity and a 
 scheduler.
 (Should've always been the case, from the outset. Not sure why it was 
 coded as
 a 

Re: Radeon regression in 6.6 kernel

2023-11-29 Thread Alex Deucher
On Wed, Nov 29, 2023 at 3:10 PM Alex Deucher  wrote:
>
> Actually I think I see the problem.  I'll try and send out a patch
> later today to test.

Does the attached patch fix it?

Alex

>
> Alex
>
> On Wed, Nov 29, 2023 at 1:52 PM Alex Deucher  wrote:
> >
> > On Wed, Nov 29, 2023 at 11:41 AM Luben Tuikov  wrote:
> > >
> > > On 2023-11-29 10:22, Alex Deucher wrote:
> > > > On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher  
> > > > wrote:
> > > >>
> > > >> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov  
> > > >> wrote:
> > > >>>
> > > >>> On 2023-11-28 17:13, Alex Deucher wrote:
> > >  On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  
> > >  wrote:
> > > >
> > > > Alex Deucher  writes:
> > > >
> > > >>> In that case those are the already known problems with the 
> > > >>> scheduler
> > > >>> changes, aren't they?
> > > >>
> > > >> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
> > > >> misunderstanding what the original report was actually testing.  
> > > >> If it
> > > >> was 6.7, then try reverting:
> > > >> 56e449603f0ac580700621a356d35d5716a62ce5
> > > >> b70438004a14f4d0f9890b3297cd66248728546c
> > > >
> > > > At some point it was suggested that I file a gitlab issue, but I 
> > > > took
> > > > this to mean it was already known and being worked on.  -rc3 came 
> > > > out
> > > > today and still has the problem.  Is there a known issue I could 
> > > > track?
> > > >
> > > 
> > >  At this point, unless there are any objections, I think we should 
> > >  just
> > >  revert the two patches
> > > >>> Uhm, no.
> > > >>>
> > > >>> Why "the two" patches?
> > > >>>
> > > >>> This email, part of this thread,
> > > >>>
> > > >>> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/
> > > >>>
> > > >>> clearly states that reverting *only* this commit,
> > > >>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable 
> > > >>> number of run-queues
> > > >>> *does not* mitigate the failed suspend. (Furthermore, this commit 
> > > >>> doesn't really change
> > > >>> anything operational, other than using an allocated array, instead of 
> > > >>> a static one, in DRM,
> > > >>> while the 2nd patch is solely contained within the amdgpu driver 
> > > >>> code.)
> > > >>>
> > > >>> Leaving us with only this change,
> > > >>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level
> > > >>> to be at fault, as the kernel log attached in the linked email above 
> > > >>> shows.
> > > >>>
> > > >>> The conclusion is that only b70438004a14f4 needs reverting.
> > > >>
> > > >> b70438004a14f4 was a fix for 56e449603f0ac5.  Without b70438004a14f4,
> > > >> 56e449603f0ac5 breaks amdgpu.
> > > >
> > > > We can try and re-enable it in the next kernel.  I'm just not sure
> > > > we'll be able to fix this in time for 6.7 with the holidays and all
> > > > and I don't want to cause a lot of scheduler churn at the end of the
> > > > 6.7 cycle if we hold off and try and fix it.  Reverting seems like the
> > > > best short term solution.
> > >
> > > A lot of subsequent code has come in since commit 56e449603f0ac5, as it 
> > > opened
> > > the opportunity for a 1-to-1 relationship between an entity and a 
> > > scheduler.
> > > (Should've always been the case, from the outset. Not sure why it was 
> > > coded as
> > > a fixed-size array.)
> > >
> > > Given that commit 56e449603f0ac5 has nothing to do with amdgpu, and the 
> > > problem
> > > is wholly contained in amdgpu, and no other driver has this problem, 
> > > there is
> > > no reason to have to "churn", i.e. go back and forth in DRM, only to 
> > > cover up
> > > an init bug in amdgpu. See the response I just sent in @this thread:
> > > https://lore.kernel.org/r/05007cb0-871e-4dc7-af58-1351f4ba4...@gmail.com
> > >
> > > And it's not like this issue is unknown. I first posted about it on 
> > > 2023-10-16.
> > >
> > > Ideally, amdgpu would just fix their init code.
> >
> > You can't make changes to core code that break other drivers.
> > Arguably 56e449603f0ac5 should not have gone in in the first place if
> > it broke amdgpu.  b70438004a14f4 was the code to fix amdgpu's init
> > code, but as a side effect it seems to have broken suspend for some
> > users.
> >
> > Alex
From 96e75b5218f7a124eafa53853681eef8fe567ab8 Mon Sep 17 00:00:00 2001
From: Alex Deucher 
Date: Wed, 29 Nov 2023 15:44:25 -0500
Subject: [PATCH] drm/amdgpu: fix buffer funcs setting order on suspend

We need to make disable this after the last eviction
call, but before we disable the SDMA IP.

Fixes: b70438004a14 ("drm/amdgpu: move buffer funcs setting up a level")
Link: https://lists.freedesktop.org/archives/amd-gfx/2023-November/101197.html
Signed-off-by: Alex Deucher 
Cc: Phillip Susi 
Cc: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 

Re: Radeon regression in 6.6 kernel

2023-11-29 Thread Phillip Susi
Luben Tuikov  writes:

> I remember that the problem was really that amdgpu called 
> drm_sched_entity_init(),
> in amdgpu_ttm_set_buffer_funcs_status() without actually having initialized 
> the scheduler
> used therein. For instance, the code before commit b70438004a14f4, looked 
> like this:
>



>   sched = >sched; <-- LT: No 
> one has initialized this scheduler
>   r = drm_sched_entity_init(>mman.entity, <-- Oopses, 
> now that sched->sched_rq is not



> Before commit 56e449603f0ac5, amdgpu was getting away with this, because the 
> sched->sched_rq
> was a static array.
>
> Ideally, amdgpu code would be fixed.

This sounds like an initilization problem that resulted in an OOPS at
boot time, but I don't remember seeing that.  I just get a failure on
system suspend.


Re: Radeon regression in 6.6 kernel

2023-11-29 Thread Alex Deucher
Actually I think I see the problem.  I'll try and send out a patch
later today to test.

Alex

On Wed, Nov 29, 2023 at 1:52 PM Alex Deucher  wrote:
>
> On Wed, Nov 29, 2023 at 11:41 AM Luben Tuikov  wrote:
> >
> > On 2023-11-29 10:22, Alex Deucher wrote:
> > > On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher  
> > > wrote:
> > >>
> > >> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov  
> > >> wrote:
> > >>>
> > >>> On 2023-11-28 17:13, Alex Deucher wrote:
> >  On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  
> >  wrote:
> > >
> > > Alex Deucher  writes:
> > >
> > >>> In that case those are the already known problems with the scheduler
> > >>> changes, aren't they?
> > >>
> > >> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
> > >> misunderstanding what the original report was actually testing.  If 
> > >> it
> > >> was 6.7, then try reverting:
> > >> 56e449603f0ac580700621a356d35d5716a62ce5
> > >> b70438004a14f4d0f9890b3297cd66248728546c
> > >
> > > At some point it was suggested that I file a gitlab issue, but I took
> > > this to mean it was already known and being worked on.  -rc3 came out
> > > today and still has the problem.  Is there a known issue I could 
> > > track?
> > >
> > 
> >  At this point, unless there are any objections, I think we should just
> >  revert the two patches
> > >>> Uhm, no.
> > >>>
> > >>> Why "the two" patches?
> > >>>
> > >>> This email, part of this thread,
> > >>>
> > >>> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/
> > >>>
> > >>> clearly states that reverting *only* this commit,
> > >>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number 
> > >>> of run-queues
> > >>> *does not* mitigate the failed suspend. (Furthermore, this commit 
> > >>> doesn't really change
> > >>> anything operational, other than using an allocated array, instead of a 
> > >>> static one, in DRM,
> > >>> while the 2nd patch is solely contained within the amdgpu driver code.)
> > >>>
> > >>> Leaving us with only this change,
> > >>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level
> > >>> to be at fault, as the kernel log attached in the linked email above 
> > >>> shows.
> > >>>
> > >>> The conclusion is that only b70438004a14f4 needs reverting.
> > >>
> > >> b70438004a14f4 was a fix for 56e449603f0ac5.  Without b70438004a14f4,
> > >> 56e449603f0ac5 breaks amdgpu.
> > >
> > > We can try and re-enable it in the next kernel.  I'm just not sure
> > > we'll be able to fix this in time for 6.7 with the holidays and all
> > > and I don't want to cause a lot of scheduler churn at the end of the
> > > 6.7 cycle if we hold off and try and fix it.  Reverting seems like the
> > > best short term solution.
> >
> > A lot of subsequent code has come in since commit 56e449603f0ac5, as it 
> > opened
> > the opportunity for a 1-to-1 relationship between an entity and a scheduler.
> > (Should've always been the case, from the outset. Not sure why it was coded 
> > as
> > a fixed-size array.)
> >
> > Given that commit 56e449603f0ac5 has nothing to do with amdgpu, and the 
> > problem
> > is wholly contained in amdgpu, and no other driver has this problem, there 
> > is
> > no reason to have to "churn", i.e. go back and forth in DRM, only to cover 
> > up
> > an init bug in amdgpu. See the response I just sent in @this thread:
> > https://lore.kernel.org/r/05007cb0-871e-4dc7-af58-1351f4ba4...@gmail.com
> >
> > And it's not like this issue is unknown. I first posted about it on 
> > 2023-10-16.
> >
> > Ideally, amdgpu would just fix their init code.
>
> You can't make changes to core code that break other drivers.
> Arguably 56e449603f0ac5 should not have gone in in the first place if
> it broke amdgpu.  b70438004a14f4 was the code to fix amdgpu's init
> code, but as a side effect it seems to have broken suspend for some
> users.
>
> Alex


Re: Radeon regression in 6.6 kernel

2023-11-29 Thread Alex Deucher
On Wed, Nov 29, 2023 at 11:41 AM Luben Tuikov  wrote:
>
> On 2023-11-29 10:22, Alex Deucher wrote:
> > On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher  wrote:
> >>
> >> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov  wrote:
> >>>
> >>> On 2023-11-28 17:13, Alex Deucher wrote:
>  On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  wrote:
> >
> > Alex Deucher  writes:
> >
> >>> In that case those are the already known problems with the scheduler
> >>> changes, aren't they?
> >>
> >> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
> >> misunderstanding what the original report was actually testing.  If it
> >> was 6.7, then try reverting:
> >> 56e449603f0ac580700621a356d35d5716a62ce5
> >> b70438004a14f4d0f9890b3297cd66248728546c
> >
> > At some point it was suggested that I file a gitlab issue, but I took
> > this to mean it was already known and being worked on.  -rc3 came out
> > today and still has the problem.  Is there a known issue I could track?
> >
> 
>  At this point, unless there are any objections, I think we should just
>  revert the two patches
> >>> Uhm, no.
> >>>
> >>> Why "the two" patches?
> >>>
> >>> This email, part of this thread,
> >>>
> >>> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/
> >>>
> >>> clearly states that reverting *only* this commit,
> >>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of 
> >>> run-queues
> >>> *does not* mitigate the failed suspend. (Furthermore, this commit doesn't 
> >>> really change
> >>> anything operational, other than using an allocated array, instead of a 
> >>> static one, in DRM,
> >>> while the 2nd patch is solely contained within the amdgpu driver code.)
> >>>
> >>> Leaving us with only this change,
> >>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level
> >>> to be at fault, as the kernel log attached in the linked email above 
> >>> shows.
> >>>
> >>> The conclusion is that only b70438004a14f4 needs reverting.
> >>
> >> b70438004a14f4 was a fix for 56e449603f0ac5.  Without b70438004a14f4,
> >> 56e449603f0ac5 breaks amdgpu.
> >
> > We can try and re-enable it in the next kernel.  I'm just not sure
> > we'll be able to fix this in time for 6.7 with the holidays and all
> > and I don't want to cause a lot of scheduler churn at the end of the
> > 6.7 cycle if we hold off and try and fix it.  Reverting seems like the
> > best short term solution.
>
> A lot of subsequent code has come in since commit 56e449603f0ac5, as it opened
> the opportunity for a 1-to-1 relationship between an entity and a scheduler.
> (Should've always been the case, from the outset. Not sure why it was coded as
> a fixed-size array.)
>
> Given that commit 56e449603f0ac5 has nothing to do with amdgpu, and the 
> problem
> is wholly contained in amdgpu, and no other driver has this problem, there is
> no reason to have to "churn", i.e. go back and forth in DRM, only to cover up
> an init bug in amdgpu. See the response I just sent in @this thread:
> https://lore.kernel.org/r/05007cb0-871e-4dc7-af58-1351f4ba4...@gmail.com
>
> And it's not like this issue is unknown. I first posted about it on 
> 2023-10-16.
>
> Ideally, amdgpu would just fix their init code.

You can't make changes to core code that break other drivers.
Arguably 56e449603f0ac5 should not have gone in in the first place if
it broke amdgpu.  b70438004a14f4 was the code to fix amdgpu's init
code, but as a side effect it seems to have broken suspend for some
users.

Alex


Re: Radeon regression in 6.6 kernel

2023-11-29 Thread Alex Deucher
On Wed, Nov 29, 2023 at 11:21 AM Luben Tuikov  wrote:
>
> On 2023-11-29 08:50, Alex Deucher wrote:
> > On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov  wrote:
> >>
> >> On 2023-11-28 17:13, Alex Deucher wrote:
> >>> On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  wrote:
> 
>  Alex Deucher  writes:
> 
> >> In that case those are the already known problems with the scheduler
> >> changes, aren't they?
> >
> > Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
> > misunderstanding what the original report was actually testing.  If it
> > was 6.7, then try reverting:
> > 56e449603f0ac580700621a356d35d5716a62ce5
> > b70438004a14f4d0f9890b3297cd66248728546c
> 
>  At some point it was suggested that I file a gitlab issue, but I took
>  this to mean it was already known and being worked on.  -rc3 came out
>  today and still has the problem.  Is there a known issue I could track?
> 
> >>>
> >>> At this point, unless there are any objections, I think we should just
> >>> revert the two patches
> >> Uhm, no.
> >>
> >> Why "the two" patches?
> >>
> >> This email, part of this thread,
> >>
> >> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/
> >>
> >> clearly states that reverting *only* this commit,
> >> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of 
> >> run-queues
> >> *does not* mitigate the failed suspend. (Furthermore, this commit doesn't 
> >> really change
> >> anything operational, other than using an allocated array, instead of a 
> >> static one, in DRM,
> >> while the 2nd patch is solely contained within the amdgpu driver code.)
> >>
> >> Leaving us with only this change,
> >> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level
> >> to be at fault, as the kernel log attached in the linked email above shows.
> >>
> >> The conclusion is that only b70438004a14f4 needs reverting.
> >
> > b70438004a14f4 was a fix for 56e449603f0ac5.  Without b70438004a14f4,
> > 56e449603f0ac5 breaks amdgpu.
>
> It doesn't "break" it, amdgpu just needs to be fixed.
>
> I know we put in a Fixes tag in
> b70438004a14f4 "drm/amdgpu: move buffer funcs setting up a level"
> pointing to 56e449603f0ac5 "drm/sched: Convert the GPU scheduler to variable 
> number of run-queues",
> but given the testing Phillip has done, the culprit is wholly contained in
> the amdgpu driver code.
>
> No other driver has this problem since commit 56e449603f0ac5.
>
> The Fixes tag in b70438004a14f4 "drm/amdgpu: move buffer funcs setting up a 
> level" should've ideally
> pointed to an amdgpu-driver code commit only (perhaps an old-old commit), and 
> I was a bit uncomfortable
> putting in a Fixes tag which pointed to drm code, but we did it so that the 
> amdgpu commit follows
> the changes in DRM. In retrospect, the Fixes tag should've pointed to and 
> amdgpu-driver commit when
> that the amdgpu code was originally written.
>
> I remember that the problem was really that amdgpu called 
> drm_sched_entity_init(),
> in amdgpu_ttm_set_buffer_funcs_status() without actually having initialized 
> the scheduler
> used therein. For instance, the code before commit b70438004a14f4, looked 
> like this:
>
> void amdgpu_ttm_set_buffer_funcs_status(struct amdgpu_device *adev, bool 
> enable)
> {
> struct ttm_resource_manager *man = ttm_manager_type(>mman.bdev, 
> TTM_PL_VRAM);
> uint64_t size;
> int r;
>
> if (!adev->mman.initialized || amdgpu_in_reset(adev) ||
> adev->mman.buffer_funcs_enabled == enable)
> return;
>
> if (enable) {
> struct amdgpu_ring *ring;
> struct drm_gpu_scheduler *sched;
>
> ring = adev->mman.buffer_funcs_ring;
> sched = >sched; <-- LT: No 
> one has initialized this scheduler
> r = drm_sched_entity_init(>mman.entity, <-- Oopses, 
> now that sched->sched_rq is not a static array
>   DRM_SCHED_PRIORITY_KERNEL, ,
>   1, NULL);
> if (r) {
> DRM_ERROR("Failed setting up TTM BO move entity 
> (%d)\n",
>   r);
> return;
> }
>
>
> Before commit 56e449603f0ac5, amdgpu was getting away with this, because the 
> sched->sched_rq
> was a static array.
>
> Ideally, amdgpu code would be fixed.

b70438004a14f4 was the amdgpu fix for this, but it appears to break
suspend for some users.  I'm not confident we can fix it in time for
6.7 final.

Alex


Re: Radeon regression in 6.6 kernel

2023-11-29 Thread Luben Tuikov
On 2023-11-29 10:22, Alex Deucher wrote:
> On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher  wrote:
>>
>> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov  wrote:
>>>
>>> On 2023-11-28 17:13, Alex Deucher wrote:
 On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  wrote:
>
> Alex Deucher  writes:
>
>>> In that case those are the already known problems with the scheduler
>>> changes, aren't they?
>>
>> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
>> misunderstanding what the original report was actually testing.  If it
>> was 6.7, then try reverting:
>> 56e449603f0ac580700621a356d35d5716a62ce5
>> b70438004a14f4d0f9890b3297cd66248728546c
>
> At some point it was suggested that I file a gitlab issue, but I took
> this to mean it was already known and being worked on.  -rc3 came out
> today and still has the problem.  Is there a known issue I could track?
>

 At this point, unless there are any objections, I think we should just
 revert the two patches
>>> Uhm, no.
>>>
>>> Why "the two" patches?
>>>
>>> This email, part of this thread,
>>>
>>> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/
>>>
>>> clearly states that reverting *only* this commit,
>>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of 
>>> run-queues
>>> *does not* mitigate the failed suspend. (Furthermore, this commit doesn't 
>>> really change
>>> anything operational, other than using an allocated array, instead of a 
>>> static one, in DRM,
>>> while the 2nd patch is solely contained within the amdgpu driver code.)
>>>
>>> Leaving us with only this change,
>>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level
>>> to be at fault, as the kernel log attached in the linked email above shows.
>>>
>>> The conclusion is that only b70438004a14f4 needs reverting.
>>
>> b70438004a14f4 was a fix for 56e449603f0ac5.  Without b70438004a14f4,
>> 56e449603f0ac5 breaks amdgpu.
> 
> We can try and re-enable it in the next kernel.  I'm just not sure
> we'll be able to fix this in time for 6.7 with the holidays and all
> and I don't want to cause a lot of scheduler churn at the end of the
> 6.7 cycle if we hold off and try and fix it.  Reverting seems like the
> best short term solution.

A lot of subsequent code has come in since commit 56e449603f0ac5, as it opened
the opportunity for a 1-to-1 relationship between an entity and a scheduler.
(Should've always been the case, from the outset. Not sure why it was coded as
a fixed-size array.)

Given that commit 56e449603f0ac5 has nothing to do with amdgpu, and the problem
is wholly contained in amdgpu, and no other driver has this problem, there is
no reason to have to "churn", i.e. go back and forth in DRM, only to cover up
an init bug in amdgpu. See the response I just sent in @this thread:
https://lore.kernel.org/r/05007cb0-871e-4dc7-af58-1351f4ba4...@gmail.com

And it's not like this issue is unknown. I first posted about it on 2023-10-16. 

Ideally, amdgpu would just fix their init code.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: Radeon regression in 6.6 kernel

2023-11-29 Thread Luben Tuikov
On 2023-11-29 08:50, Alex Deucher wrote:
> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov  wrote:
>>
>> On 2023-11-28 17:13, Alex Deucher wrote:
>>> On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  wrote:

 Alex Deucher  writes:

>> In that case those are the already known problems with the scheduler
>> changes, aren't they?
>
> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
> misunderstanding what the original report was actually testing.  If it
> was 6.7, then try reverting:
> 56e449603f0ac580700621a356d35d5716a62ce5
> b70438004a14f4d0f9890b3297cd66248728546c

 At some point it was suggested that I file a gitlab issue, but I took
 this to mean it was already known and being worked on.  -rc3 came out
 today and still has the problem.  Is there a known issue I could track?

>>>
>>> At this point, unless there are any objections, I think we should just
>>> revert the two patches
>> Uhm, no.
>>
>> Why "the two" patches?
>>
>> This email, part of this thread,
>>
>> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/
>>
>> clearly states that reverting *only* this commit,
>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of 
>> run-queues
>> *does not* mitigate the failed suspend. (Furthermore, this commit doesn't 
>> really change
>> anything operational, other than using an allocated array, instead of a 
>> static one, in DRM,
>> while the 2nd patch is solely contained within the amdgpu driver code.)
>>
>> Leaving us with only this change,
>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level
>> to be at fault, as the kernel log attached in the linked email above shows.
>>
>> The conclusion is that only b70438004a14f4 needs reverting.
> 
> b70438004a14f4 was a fix for 56e449603f0ac5.  Without b70438004a14f4,
> 56e449603f0ac5 breaks amdgpu.

It doesn't "break" it, amdgpu just needs to be fixed.

I know we put in a Fixes tag in 
b70438004a14f4 "drm/amdgpu: move buffer funcs setting up a level"
pointing to 56e449603f0ac5 "drm/sched: Convert the GPU scheduler to variable 
number of run-queues",
but given the testing Phillip has done, the culprit is wholly contained in
the amdgpu driver code.

No other driver has this problem since commit 56e449603f0ac5.

The Fixes tag in b70438004a14f4 "drm/amdgpu: move buffer funcs setting up a 
level" should've ideally
pointed to an amdgpu-driver code commit only (perhaps an old-old commit), and I 
was a bit uncomfortable
putting in a Fixes tag which pointed to drm code, but we did it so that the 
amdgpu commit follows
the changes in DRM. In retrospect, the Fixes tag should've pointed to and 
amdgpu-driver commit when
that the amdgpu code was originally written.

I remember that the problem was really that amdgpu called 
drm_sched_entity_init(),
in amdgpu_ttm_set_buffer_funcs_status() without actually having initialized the 
scheduler
used therein. For instance, the code before commit b70438004a14f4, looked like 
this:

void amdgpu_ttm_set_buffer_funcs_status(struct amdgpu_device *adev, bool enable)
{
struct ttm_resource_manager *man = ttm_manager_type(>mman.bdev, 
TTM_PL_VRAM);
uint64_t size;
int r;

if (!adev->mman.initialized || amdgpu_in_reset(adev) ||
adev->mman.buffer_funcs_enabled == enable)
return;

if (enable) {
struct amdgpu_ring *ring;
struct drm_gpu_scheduler *sched;

ring = adev->mman.buffer_funcs_ring;
sched = >sched; <-- LT: No 
one has initialized this scheduler
r = drm_sched_entity_init(>mman.entity, <-- Oopses, 
now that sched->sched_rq is not a static array
  DRM_SCHED_PRIORITY_KERNEL, ,
  1, NULL);
if (r) {
DRM_ERROR("Failed setting up TTM BO move entity (%d)\n",
  r);
return;
}


Before commit 56e449603f0ac5, amdgpu was getting away with this, because the 
sched->sched_rq
was a static array.

Ideally, amdgpu code would be fixed.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: Radeon regression in 6.6 kernel

2023-11-29 Thread Alex Deucher
On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher  wrote:
>
> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov  wrote:
> >
> > On 2023-11-28 17:13, Alex Deucher wrote:
> > > On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  wrote:
> > >>
> > >> Alex Deucher  writes:
> > >>
> >  In that case those are the already known problems with the scheduler
> >  changes, aren't they?
> > >>>
> > >>> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
> > >>> misunderstanding what the original report was actually testing.  If it
> > >>> was 6.7, then try reverting:
> > >>> 56e449603f0ac580700621a356d35d5716a62ce5
> > >>> b70438004a14f4d0f9890b3297cd66248728546c
> > >>
> > >> At some point it was suggested that I file a gitlab issue, but I took
> > >> this to mean it was already known and being worked on.  -rc3 came out
> > >> today and still has the problem.  Is there a known issue I could track?
> > >>
> > >
> > > At this point, unless there are any objections, I think we should just
> > > revert the two patches
> > Uhm, no.
> >
> > Why "the two" patches?
> >
> > This email, part of this thread,
> >
> > https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/
> >
> > clearly states that reverting *only* this commit,
> > 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of 
> > run-queues
> > *does not* mitigate the failed suspend. (Furthermore, this commit doesn't 
> > really change
> > anything operational, other than using an allocated array, instead of a 
> > static one, in DRM,
> > while the 2nd patch is solely contained within the amdgpu driver code.)
> >
> > Leaving us with only this change,
> > b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level
> > to be at fault, as the kernel log attached in the linked email above shows.
> >
> > The conclusion is that only b70438004a14f4 needs reverting.
>
> b70438004a14f4 was a fix for 56e449603f0ac5.  Without b70438004a14f4,
> 56e449603f0ac5 breaks amdgpu.

We can try and re-enable it in the next kernel.  I'm just not sure
we'll be able to fix this in time for 6.7 with the holidays and all
and I don't want to cause a lot of scheduler churn at the end of the
6.7 cycle if we hold off and try and fix it.  Reverting seems like the
best short term solution.

Alex

>
> Alex


Re: Radeon regression in 6.6 kernel

2023-11-29 Thread Alex Deucher
On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov  wrote:
>
> On 2023-11-28 17:13, Alex Deucher wrote:
> > On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  wrote:
> >>
> >> Alex Deucher  writes:
> >>
>  In that case those are the already known problems with the scheduler
>  changes, aren't they?
> >>>
> >>> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
> >>> misunderstanding what the original report was actually testing.  If it
> >>> was 6.7, then try reverting:
> >>> 56e449603f0ac580700621a356d35d5716a62ce5
> >>> b70438004a14f4d0f9890b3297cd66248728546c
> >>
> >> At some point it was suggested that I file a gitlab issue, but I took
> >> this to mean it was already known and being worked on.  -rc3 came out
> >> today and still has the problem.  Is there a known issue I could track?
> >>
> >
> > At this point, unless there are any objections, I think we should just
> > revert the two patches
> Uhm, no.
>
> Why "the two" patches?
>
> This email, part of this thread,
>
> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/
>
> clearly states that reverting *only* this commit,
> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of 
> run-queues
> *does not* mitigate the failed suspend. (Furthermore, this commit doesn't 
> really change
> anything operational, other than using an allocated array, instead of a 
> static one, in DRM,
> while the 2nd patch is solely contained within the amdgpu driver code.)
>
> Leaving us with only this change,
> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level
> to be at fault, as the kernel log attached in the linked email above shows.
>
> The conclusion is that only b70438004a14f4 needs reverting.

b70438004a14f4 was a fix for 56e449603f0ac5.  Without b70438004a14f4,
56e449603f0ac5 breaks amdgpu.

Alex


Re: Radeon regression in 6.6 kernel

2023-11-28 Thread Luben Tuikov
On 2023-11-28 17:13, Alex Deucher wrote:
> On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  wrote:
>>
>> Alex Deucher  writes:
>>
 In that case those are the already known problems with the scheduler
 changes, aren't they?
>>>
>>> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
>>> misunderstanding what the original report was actually testing.  If it
>>> was 6.7, then try reverting:
>>> 56e449603f0ac580700621a356d35d5716a62ce5
>>> b70438004a14f4d0f9890b3297cd66248728546c
>>
>> At some point it was suggested that I file a gitlab issue, but I took
>> this to mean it was already known and being worked on.  -rc3 came out
>> today and still has the problem.  Is there a known issue I could track?
>>
> 
> At this point, unless there are any objections, I think we should just
> revert the two patches
Uhm, no.

Why "the two" patches?

This email, part of this thread,

https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/

clearly states that reverting *only* this commit,
56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of 
run-queues
*does not* mitigate the failed suspend. (Furthermore, this commit doesn't 
really change
anything operational, other than using an allocated array, instead of a static 
one, in DRM,
while the 2nd patch is solely contained within the amdgpu driver code.)

Leaving us with only this change,
b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level
to be at fault, as the kernel log attached in the linked email above shows.

The conclusion is that only b70438004a14f4 needs reverting.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: Radeon regression in 6.6 kernel

2023-11-28 Thread Alex Deucher
On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  wrote:
>
> Alex Deucher  writes:
>
> >> In that case those are the already known problems with the scheduler
> >> changes, aren't they?
> >
> > Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
> > misunderstanding what the original report was actually testing.  If it
> > was 6.7, then try reverting:
> > 56e449603f0ac580700621a356d35d5716a62ce5
> > b70438004a14f4d0f9890b3297cd66248728546c
>
> At some point it was suggested that I file a gitlab issue, but I took
> this to mean it was already known and being worked on.  -rc3 came out
> today and still has the problem.  Is there a known issue I could track?
>

At this point, unless there are any objections, I think we should just
revert the two patches.

Alex


Re: Radeon regression in 6.6 kernel

2023-11-27 Thread Phillip Susi
Alex Deucher  writes:

>> In that case those are the already known problems with the scheduler
>> changes, aren't they?
>
> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
> misunderstanding what the original report was actually testing.  If it
> was 6.7, then try reverting:
> 56e449603f0ac580700621a356d35d5716a62ce5
> b70438004a14f4d0f9890b3297cd66248728546c

At some point it was suggested that I file a gitlab issue, but I took
this to mean it was already known and being worked on.  -rc3 came out
today and still has the problem.  Is there a known issue I could track?



Re: Radeon regression in 6.6 kernel

2023-11-22 Thread Luben Tuikov
On 2023-11-21 17:05, Phillip Susi wrote:
> Alex Deucher  writes:
> 
>> Does reverting 56e449603f0ac580700621a356d35d5716a62ce5 alone fix it?
>> Can you also attach your full dmesg log for the failed suspend?
> 
> No, it doesn't.  Here is the full syslog from the boot with only that
> revert:
> 

Thank you Phillip for verifying this.

BTW, luben.tui...@amd.com should absolutely bounce for everyone sending emails 
to it. Not sure why it is still active.
My new email is the one this email is coming from.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: Radeon regression in 6.6 kernel

2023-11-21 Thread Alex Deucher
On Mon, Nov 20, 2023 at 5:40 PM Phillip Susi  wrote:
>
> Alex Deucher  writes:
>
> > Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
> > misunderstanding what the original report was actually testing.  If it
> > was 6.7, then try reverting:
> > 56e449603f0ac580700621a356d35d5716a62ce5
> > b70438004a14f4d0f9890b3297cd66248728546c
>
> I had been running v6.6-rc5 before pulling.  It looks like that got me
> somewhere between v6.6 and v6.7-rc1.  Reverting those two commits fixes
> it.

Does reverting 56e449603f0ac580700621a356d35d5716a62ce5 alone fix it?
Can you also attach your full dmesg log for the failed suspend?

Alex


Re: Radeon regression in 6.6 kernel

2023-11-20 Thread Phillip Susi
Alex Deucher  writes:

> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
> misunderstanding what the original report was actually testing.  If it
> was 6.7, then try reverting:
> 56e449603f0ac580700621a356d35d5716a62ce5
> b70438004a14f4d0f9890b3297cd66248728546c

I had been running v6.6-rc5 before pulling.  It looks like that got me
somewhere between v6.6 and v6.7-rc1.  Reverting those two commits fixes
it.



Re: Radeon regression in 6.6 kernel

2023-11-20 Thread Phillip Susi
Christian König  writes:

> Well none of the commits mentioned can affect radeon in any way. Radeon 
> simply doesn't use the scheduler.
>
> My suspicion is that the user is actually using amdgpu instead of 
> radeon. The switch potentially occurred accidentally, for example by 
> compiling amdgpu support for SI/CIK.

Indeed, the lspci I originally posted does indicate amdgpu.  What is the
difference and should I switch it?  If so, how?

> Those amdgpu problems for older ASIC have already been worked on and 
> should be fixed by now.

I just pulled v6.7-rc2 and it's still broken.  I'll see if I can revert
those 3 patches.


Re: Radeon regression in 6.6 kernel

2023-11-20 Thread Alex Deucher
On Mon, Nov 20, 2023 at 11:24 AM Christian König
 wrote:
>
> Am 20.11.23 um 17:08 schrieb Alex Deucher:
> > On Mon, Nov 20, 2023 at 10:57 AM Christian König
> >  wrote:
> >> Am 19.11.23 um 07:47 schrieb Dave Airlie:
>  On 12.11.23 01:46, Phillip Susi wrote:
> > I had been testing some things on a post 6.6-rc5 kernel for a week or
> > two and then when I pulled to a post 6.6 release kernel, I found that
> > system suspend was broken.  It seems that the radeon driver failed to
> > suspend, leaving the display dead, the wayland display server hung, and
> > the system still running.  I have been trying to bisect it for the last
> > few days and have only been able to narrow it down to the following 3
> > commits:
> >
> > There are only 'skip'ped commits left to test.
> > The first bad commit could be any of:
> > 56e449603f0ac580700621a356d35d5716a62ce5
> > c07bf1636f0005f9eb7956404490672286ea59d3
> > b70438004a14f4d0f9890b3297cd66248728546c
> > We cannot bisect more!
>  Hmm, not a single reply from the amdgpu folks. Wondering how we can
>  encourage them to look into this.
> 
>  Phillip, reporting issues by mail should still work, but you might have
>  more luck here, as that's where the amdgpu afaics prefer to track bugs:
>  https://gitlab.freedesktop.org/drm/amd/-/issues
> 
>  When you file an issue there, please mention it here.
> 
>  Furthermore it might help if you could verify if 6.7-rc1 (or rc2, which
>  comes out later today) or 6.6.2-rc1 improve things.
> >>> It would also be good to test if reverting any of these is possible or 
> >>> not.
> >> Well none of the commits mentioned can affect radeon in any way. Radeon
> >> simply doesn't use the scheduler.
> >>
> >> My suspicion is that the user is actually using amdgpu instead of
> >> radeon. The switch potentially occurred accidentally, for example by
> >> compiling amdgpu support for SI/CIK.
> >>
> >> Those amdgpu problems for older ASIC have already been worked on and
> >> should be fixed by now.
> > In this case it's a navi23 (so radeon in the marketing sense).
>
> Thanks, couldn't find that in the mail thread.
>
> In that case those are the already known problems with the scheduler
> changes, aren't they?

Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
misunderstanding what the original report was actually testing.  If it
was 6.7, then try reverting:
56e449603f0ac580700621a356d35d5716a62ce5
b70438004a14f4d0f9890b3297cd66248728546c

Alex

>
> Christian.
>
> >
> > Alex
> >
> >> Regards,
> >> Christian.
> >>
> >>> File the gitlab issue and we should poke amd a but more to take a look.
> >>>
> >>> Dave.
>


Re: Radeon regression in 6.6 kernel

2023-11-20 Thread Christian König

Am 20.11.23 um 17:08 schrieb Alex Deucher:

On Mon, Nov 20, 2023 at 10:57 AM Christian König
 wrote:

Am 19.11.23 um 07:47 schrieb Dave Airlie:

On 12.11.23 01:46, Phillip Susi wrote:

I had been testing some things on a post 6.6-rc5 kernel for a week or
two and then when I pulled to a post 6.6 release kernel, I found that
system suspend was broken.  It seems that the radeon driver failed to
suspend, leaving the display dead, the wayland display server hung, and
the system still running.  I have been trying to bisect it for the last
few days and have only been able to narrow it down to the following 3
commits:

There are only 'skip'ped commits left to test.
The first bad commit could be any of:
56e449603f0ac580700621a356d35d5716a62ce5
c07bf1636f0005f9eb7956404490672286ea59d3
b70438004a14f4d0f9890b3297cd66248728546c
We cannot bisect more!

Hmm, not a single reply from the amdgpu folks. Wondering how we can
encourage them to look into this.

Phillip, reporting issues by mail should still work, but you might have
more luck here, as that's where the amdgpu afaics prefer to track bugs:
https://gitlab.freedesktop.org/drm/amd/-/issues

When you file an issue there, please mention it here.

Furthermore it might help if you could verify if 6.7-rc1 (or rc2, which
comes out later today) or 6.6.2-rc1 improve things.

It would also be good to test if reverting any of these is possible or not.

Well none of the commits mentioned can affect radeon in any way. Radeon
simply doesn't use the scheduler.

My suspicion is that the user is actually using amdgpu instead of
radeon. The switch potentially occurred accidentally, for example by
compiling amdgpu support for SI/CIK.

Those amdgpu problems for older ASIC have already been worked on and
should be fixed by now.

In this case it's a navi23 (so radeon in the marketing sense).


Thanks, couldn't find that in the mail thread.

In that case those are the already known problems with the scheduler 
changes, aren't they?


Christian.



Alex


Regards,
Christian.


File the gitlab issue and we should poke amd a but more to take a look.

Dave.




Re: Radeon regression in 6.6 kernel

2023-11-20 Thread Alex Deucher
On Mon, Nov 20, 2023 at 10:57 AM Christian König
 wrote:
>
> Am 19.11.23 um 07:47 schrieb Dave Airlie:
> >> On 12.11.23 01:46, Phillip Susi wrote:
> >>> I had been testing some things on a post 6.6-rc5 kernel for a week or
> >>> two and then when I pulled to a post 6.6 release kernel, I found that
> >>> system suspend was broken.  It seems that the radeon driver failed to
> >>> suspend, leaving the display dead, the wayland display server hung, and
> >>> the system still running.  I have been trying to bisect it for the last
> >>> few days and have only been able to narrow it down to the following 3
> >>> commits:
> >>>
> >>> There are only 'skip'ped commits left to test.
> >>> The first bad commit could be any of:
> >>> 56e449603f0ac580700621a356d35d5716a62ce5
> >>> c07bf1636f0005f9eb7956404490672286ea59d3
> >>> b70438004a14f4d0f9890b3297cd66248728546c
> >>> We cannot bisect more!
> >> Hmm, not a single reply from the amdgpu folks. Wondering how we can
> >> encourage them to look into this.
> >>
> >> Phillip, reporting issues by mail should still work, but you might have
> >> more luck here, as that's where the amdgpu afaics prefer to track bugs:
> >> https://gitlab.freedesktop.org/drm/amd/-/issues
> >>
> >> When you file an issue there, please mention it here.
> >>
> >> Furthermore it might help if you could verify if 6.7-rc1 (or rc2, which
> >> comes out later today) or 6.6.2-rc1 improve things.
> > It would also be good to test if reverting any of these is possible or not.
>
> Well none of the commits mentioned can affect radeon in any way. Radeon
> simply doesn't use the scheduler.
>
> My suspicion is that the user is actually using amdgpu instead of
> radeon. The switch potentially occurred accidentally, for example by
> compiling amdgpu support for SI/CIK.
>
> Those amdgpu problems for older ASIC have already been worked on and
> should be fixed by now.

In this case it's a navi23 (so radeon in the marketing sense).

Alex

>
> Regards,
> Christian.
>
> >
> > File the gitlab issue and we should poke amd a but more to take a look.
> >
> > Dave.
>


Re: Radeon regression in 6.6 kernel

2023-11-20 Thread Christian König

Am 19.11.23 um 07:47 schrieb Dave Airlie:

On 12.11.23 01:46, Phillip Susi wrote:

I had been testing some things on a post 6.6-rc5 kernel for a week or
two and then when I pulled to a post 6.6 release kernel, I found that
system suspend was broken.  It seems that the radeon driver failed to
suspend, leaving the display dead, the wayland display server hung, and
the system still running.  I have been trying to bisect it for the last
few days and have only been able to narrow it down to the following 3
commits:

There are only 'skip'ped commits left to test.
The first bad commit could be any of:
56e449603f0ac580700621a356d35d5716a62ce5
c07bf1636f0005f9eb7956404490672286ea59d3
b70438004a14f4d0f9890b3297cd66248728546c
We cannot bisect more!

Hmm, not a single reply from the amdgpu folks. Wondering how we can
encourage them to look into this.

Phillip, reporting issues by mail should still work, but you might have
more luck here, as that's where the amdgpu afaics prefer to track bugs:
https://gitlab.freedesktop.org/drm/amd/-/issues

When you file an issue there, please mention it here.

Furthermore it might help if you could verify if 6.7-rc1 (or rc2, which
comes out later today) or 6.6.2-rc1 improve things.

It would also be good to test if reverting any of these is possible or not.


Well none of the commits mentioned can affect radeon in any way. Radeon 
simply doesn't use the scheduler.


My suspicion is that the user is actually using amdgpu instead of 
radeon. The switch potentially occurred accidentally, for example by 
compiling amdgpu support for SI/CIK.


Those amdgpu problems for older ASIC have already been worked on and 
should be fixed by now.


Regards,
Christian.



File the gitlab issue and we should poke amd a but more to take a look.

Dave.




Re: Radeon regression in 6.6 kernel

2023-11-19 Thread Bagas Sanjaya
On 11/19/23 20:48, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 19.11.23 14:24, Bagas Sanjaya wrote:
>> Hi Dave,
>>
>> AFAIK commit c07bf1636f0005 ("MAINTAINERS: Update the GPU Scheduler email")
>> doesn't seem to do with this regression as it doesn't change any amdgpu code
>> that may introduce the regression.
> 
> Bagas, sorry for being blunt here, I know you mean well. But I feel the
> need to say the following in the open, as this otherwise falls back on
> me and regression tracking.
> 
> Stating the above is not very helpful, as Dave for sure will know.
> Telling Phillip that he likely can skip that commit might have been
> something different. But I guess even for most users that are able to do
> a bisection it's obvious and maybe not worth pointing out.
> 

I was scratching my itch then when replying. Thanks anyway.

-- 
An old man doll... just what I always wanted! - Clara



Re: Radeon regression in 6.6 kernel

2023-11-19 Thread Linux regression tracking (Thorsten Leemhuis)
On 19.11.23 14:24, Bagas Sanjaya wrote:
> On Sun, Nov 19, 2023 at 04:47:01PM +1000, Dave Airlie wrote:
>>> On 12.11.23 01:46, Phillip Susi wrote:
 I had been testing some things on a post 6.6-rc5 kernel for a week or
 two and then when I pulled to a post 6.6 release kernel, I found that
 system suspend was broken.  It seems that the radeon driver failed to
 suspend, leaving the display dead, the wayland display server hung, and
 the system still running.  I have been trying to bisect it for the last
 few days and have only been able to narrow it down to the following 3
 commits:

 There are only 'skip'ped commits left to test.
 The first bad commit could be any of:
 56e449603f0ac580700621a356d35d5716a62ce5
 c07bf1636f0005f9eb7956404490672286ea59d3
 b70438004a14f4d0f9890b3297cd66248728546c
 We cannot bisect more!
>>>
>>> Hmm, not a single reply from the amdgpu folks. Wondering how we can
>>> encourage them to look into this.
>>>
>>> Phillip, reporting issues by mail should still work, but you might have
>>> more luck here, as that's where the amdgpu afaics prefer to track bugs:
>>> https://gitlab.freedesktop.org/drm/amd/-/issues
>>>
>>> When you file an issue there, please mention it here.
>>>
>>> Furthermore it might help if you could verify if 6.7-rc1 (or rc2, which
>>> comes out later today) or 6.6.2-rc1 improve things.

BTW, ignore the "6.6.2-rc1" here, I misunderstood one detail earlier. Sorry.

>> It would also be good to test if reverting any of these is possible or not.

Good point, sorry, forgot to mention that.

> Hi Dave,
> 
> AFAIK commit c07bf1636f0005 ("MAINTAINERS: Update the GPU Scheduler email")
> doesn't seem to do with this regression as it doesn't change any amdgpu code
> that may introduce the regression.

Bagas, sorry for being blunt here, I know you mean well. But I feel the
need to say the following in the open, as this otherwise falls back on
me and regression tracking.

Stating the above is not very helpful, as Dave for sure will know.
Telling Phillip that he likely can skip that commit might have been
something different. But I guess even for most users that are able to do
a bisection it's obvious and maybe not worth pointing out.

Ciao, Thorsten


Re: Radeon regression in 6.6 kernel

2023-11-19 Thread Bagas Sanjaya
On Sun, Nov 19, 2023 at 04:47:01PM +1000, Dave Airlie wrote:
> >
> > On 12.11.23 01:46, Phillip Susi wrote:
> > > I had been testing some things on a post 6.6-rc5 kernel for a week or
> > > two and then when I pulled to a post 6.6 release kernel, I found that
> > > system suspend was broken.  It seems that the radeon driver failed to
> > > suspend, leaving the display dead, the wayland display server hung, and
> > > the system still running.  I have been trying to bisect it for the last
> > > few days and have only been able to narrow it down to the following 3
> > > commits:
> > >
> > > There are only 'skip'ped commits left to test.
> > > The first bad commit could be any of:
> > > 56e449603f0ac580700621a356d35d5716a62ce5
> > > c07bf1636f0005f9eb7956404490672286ea59d3
> > > b70438004a14f4d0f9890b3297cd66248728546c
> > > We cannot bisect more!
> >
> > Hmm, not a single reply from the amdgpu folks. Wondering how we can
> > encourage them to look into this.
> >
> > Phillip, reporting issues by mail should still work, but you might have
> > more luck here, as that's where the amdgpu afaics prefer to track bugs:
> > https://gitlab.freedesktop.org/drm/amd/-/issues
> >
> > When you file an issue there, please mention it here.
> >
> > Furthermore it might help if you could verify if 6.7-rc1 (or rc2, which
> > comes out later today) or 6.6.2-rc1 improve things.
> 
> It would also be good to test if reverting any of these is possible or not.
> 

Hi Dave,

AFAIK commit c07bf1636f0005 ("MAINTAINERS: Update the GPU Scheduler email")
doesn't seem to do with this regression as it doesn't change any amdgpu code
that may introduce the regression.

Thanks.

-- 
An old man doll... just what I always wanted! - Clara


signature.asc
Description: PGP signature


Re: Radeon regression in 6.6 kernel

2023-11-18 Thread Dave Airlie
>
> On 12.11.23 01:46, Phillip Susi wrote:
> > I had been testing some things on a post 6.6-rc5 kernel for a week or
> > two and then when I pulled to a post 6.6 release kernel, I found that
> > system suspend was broken.  It seems that the radeon driver failed to
> > suspend, leaving the display dead, the wayland display server hung, and
> > the system still running.  I have been trying to bisect it for the last
> > few days and have only been able to narrow it down to the following 3
> > commits:
> >
> > There are only 'skip'ped commits left to test.
> > The first bad commit could be any of:
> > 56e449603f0ac580700621a356d35d5716a62ce5
> > c07bf1636f0005f9eb7956404490672286ea59d3
> > b70438004a14f4d0f9890b3297cd66248728546c
> > We cannot bisect more!
>
> Hmm, not a single reply from the amdgpu folks. Wondering how we can
> encourage them to look into this.
>
> Phillip, reporting issues by mail should still work, but you might have
> more luck here, as that's where the amdgpu afaics prefer to track bugs:
> https://gitlab.freedesktop.org/drm/amd/-/issues
>
> When you file an issue there, please mention it here.
>
> Furthermore it might help if you could verify if 6.7-rc1 (or rc2, which
> comes out later today) or 6.6.2-rc1 improve things.

It would also be good to test if reverting any of these is possible or not.

File the gitlab issue and we should poke amd a but more to take a look.

Dave.


Re: Radeon regression in 6.6 kernel

2023-11-18 Thread Linux regression tracking (Thorsten Leemhuis)
Lo!

On 12.11.23 01:46, Phillip Susi wrote:
> I had been testing some things on a post 6.6-rc5 kernel for a week or
> two and then when I pulled to a post 6.6 release kernel, I found that
> system suspend was broken.  It seems that the radeon driver failed to
> suspend, leaving the display dead, the wayland display server hung, and
> the system still running.  I have been trying to bisect it for the last
> few days and have only been able to narrow it down to the following 3
> commits:
> 
> There are only 'skip'ped commits left to test.
> The first bad commit could be any of:
> 56e449603f0ac580700621a356d35d5716a62ce5
> c07bf1636f0005f9eb7956404490672286ea59d3
> b70438004a14f4d0f9890b3297cd66248728546c
> We cannot bisect more!

Hmm, not a single reply from the amdgpu folks. Wondering how we can
encourage them to look into this.

Phillip, reporting issues by mail should still work, but you might have
more luck here, as that's where the amdgpu afaics prefer to track bugs:
https://gitlab.freedesktop.org/drm/amd/-/issues

When you file an issue there, please mention it here.

Furthermore it might help if you could verify if 6.7-rc1 (or rc2, which
comes out later today) or 6.6.2-rc1 improve things.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke


> It appears that there was a late merge in the 6.6 window that originally
> forked from the -rc2, as many of the later commits that I bisected had
> that version number.
> 
> I couldn't get it more narrowed down because I had to skip the
> surrounding commits because they wouldn't even boot up to a gui desktop,
> let alone try to suspend.
> 
> When system suspend fails, I find the following in my syslog after I
> have to magic-sysrq reboot because the the display is dead:
> 
> Nov 11 18:44:39 faldara kernel: PM: suspend entry (deep)
> Nov 11 18:44:39 faldara kernel: Filesystems sync: 0.035 seconds
> Nov 11 18:44:40 faldara kernel: Freezing user space processes
> Nov 11 18:44:40 faldara kernel: Freezing user space processes completed 
> (elapsed 0.001 seconds)
> Nov 11 18:44:40 faldara kernel: OOM killer disabled.
> Nov 11 18:44:40 faldara kernel: Freezing remaining freezable tasks
> Nov 11 18:44:40 faldara kernel: Freezing remaining freezable tasks completed 
> (elapsed 0.001 seconds)
> Nov 11 18:44:40 faldara kernel: printk: Suspending console(s) (use 
> no_console_suspend to debug)
> Nov 11 18:44:40 faldara kernel: serial 00:01: disabled
> Nov 11 18:44:40 faldara kernel: e1000e: EEE TX LPI TIMER: 0011
> Nov 11 18:44:40 faldara kernel: sd 4:0:0:0: [sdb] Synchronizing SCSI cache
> Nov 11 18:44:40 faldara kernel: sd 1:0:0:0: [sda] Synchronizing SCSI cache
> Nov 11 18:44:40 faldara kernel: sd 5:0:0:0: [sdc] Synchronizing SCSI cache
> Nov 11 18:44:40 faldara kernel: sd 4:0:0:0: [sdb] Stopping disk
> Nov 11 18:44:40 faldara kernel: sd 1:0:0:0: [sda] Stopping disk
> Nov 11 18:44:40 faldara kernel: sd 5:0:0:0: [sdc] Stopping disk
> Nov 11 18:44:40 faldara kernel: amdgpu: Move buffer fallback to memcpy 
> unavailable
> Nov 11 18:44:40 faldara kernel: [TTM] Buffer eviction failed
> Nov 11 18:44:40 faldara kernel: [drm] evicting device resources failed
> Nov 11 18:44:40 faldara kernel: amdgpu :03:00.0: PM: pci_pm_suspend(): 
> amdgpu_pmops_suspend+0x0/0x80 [amdgpu] returns -19
> Nov 11 18:44:40 faldara kernel: amdgpu :03:00.0: PM: dpm_run_callback(): 
> pci_pm_suspend+0x0/0x170 returns -19
> Nov 11 18:44:40 faldara kernel: amdgpu :03:00.0: PM: failed to suspend 
> async: error -19
> Nov 11 18:44:40 faldara kernel: PM: Some devices failed to suspend, or early 
> wake event detected
> Nov 11 18:44:40 faldara kernel: xhci_hcd :06:00.0: xHC error in resume, 
> USBSTS 0x401, Reinit
> Nov 11 18:44:40 faldara kernel: usb usb3: root hub lost power or was reset
> Nov 11 18:44:40 faldara kernel: usb usb4: root hub lost power or was reset
> Nov 11 18:44:40 faldara kernel: serial 00:01: activated
> Nov 11 18:44:40 faldara kernel: nvme nvme0: 4/0/0 default/read/poll queues
> Nov 11 18:44:40 faldara kernel: ata8: SATA link down (SStatus 0 SControl 300)
> Nov 11 18:44:40 faldara kernel: ata7: SATA link down (SStatus 0 SControl 300)
> Nov 11 18:44:40 faldara kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 
> SControl 300)
> Nov 11 18:44:40 faldara kernel: ata1: SATA link down (SStatus 4 SControl 300)
> Nov 11 18:44:40 faldara kernel: ata3: SATA link down (SStatus 4 SControl 300)
> Nov 11 18:44:40 faldara kernel: ata4.00: configured for UDMA/133
> Nov 11 18:44:40 faldara kernel: OOM killer enabled.
> Nov 11 18:44:40 faldara kernel: Restarting tasks ... done.
> Nov 11 18:44:40 faldara kernel: random: crng reseeded on system resumption
> Nov 11 18:44:40 faldara kernel: PM: suspend exit
> Nov 11 18:44:40 faldara kernel: PM: suspend entry (s2idle)

Re: Radeon regression in 6.6 kernel

2023-11-12 Thread Phillip Susi
Bagas Sanjaya  writes:

> Please show the full bisect log, and also tell why these commits are
> skipped.

Two of them would not compile and one would not boot.

Here's the log.

# bad: [4bbdb725a36b0d235f3b832bd0c1e885f0442d9f] Merge tag 
'iommu-updates-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
# good: [94f6f0550c625fab1f373bb86a6669b45e9748b3] Linux 6.6-rc5
git bisect start 'HEAD' 'v6.6-rc5'
# good: [8bc9e6515183935fa0cccaf67455c439afe4982b] Merge tag 
'devicetree-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
git bisect good 8bc9e6515183935fa0cccaf67455c439afe4982b
# bad: [431f1051884e38d2a5751e4731d69b2ff289ee56] Merge tag 'leds-next-6.7' of 
git://git.kernel.org/pub/scm/linux/kernel/git/lee/leds
git bisect bad 431f1051884e38d2a5751e4731d69b2ff289ee56
# bad: [0364249d2073c32c5214f02866999ce940bc35a2] Merge tag 
'for-6.7/dm-changes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
git bisect bad 0364249d2073c32c5214f02866999ce940bc35a2
# good: [27442758e9b4e083bef3f164a1739475c01f3202] Merge tag 
'amd-drm-next-6.7-2023-10-13' of https://gitlab.freedesktop.org/agd5f/linux 
into drm-next
git bisect good 27442758e9b4e083bef3f164a1739475c01f3202
# bad: [631808095a82e6b6f8410a95f8b12b8d0d38b161] Merge tag 
'amd-drm-next-6.7-2023-10-27' of https://gitlab.freedesktop.org/agd5f/linux 
into drm-next
git bisect bad 631808095a82e6b6f8410a95f8b12b8d0d38b161
# skip: [0ecf4aa32b7896b9160688bdbd20153dc06a50fb] Merge tag 
'amd-drm-next-6.7-2023-10-20' of https://gitlab.freedesktop.org/agd5f/linux 
into drm-next
git bisect skip 0ecf4aa32b7896b9160688bdbd20153dc06a50fb
# good: [74ce0f3873821f12391bcf5469d81583d34f4c6c] accel/ivpu: Fix verbose 
version of REG_POLL macros
git bisect good 74ce0f3873821f12391bcf5469d81583d34f4c6c
# skip: [3f5ba636d6987ddffeaa056dea1c524da63912f3] Merge tag 
'drm-msm-next-2023-10-17' of https://gitlab.freedesktop.org/drm/msm into 
drm-next
git bisect skip 3f5ba636d6987ddffeaa056dea1c524da63912f3
# good: [a3cd664e7f971b0f33acb3ba790c142669cd34e5] accel/ivpu: Print IPC type 
string instead of number
git bisect good a3cd664e7f971b0f33acb3ba790c142669cd34e5
# good: [b0b0d811eac6b4c52cb9ad632fa6384cf48869e7] drm/mediatek: Fix coverity 
issue with unintentional integer overflow
git bisect good b0b0d811eac6b4c52cb9ad632fa6384cf48869e7
# good: [808b43fa7e56e94563b86af2703ba88ee156e3c2] drm/i915/dp_mst: Set 
connector DSC capabilities and decompression AUX
git bisect good 808b43fa7e56e94563b86af2703ba88ee156e3c2
# skip: [11ae5eb516b656e8a0e4efbea90ea24c152a346d] Merge tag 
'topic/vmemdup-user-array-2023-10-24-1' of 
git://anongit.freedesktop.org/drm/drm into drm-next
git bisect skip 11ae5eb516b656e8a0e4efbea90ea24c152a346d
# good: [a67f7a0b18c09d5b62eafb6d5c2f54e6f6ea6cf1] drm/amd/display: Update SDP 
VSC colorimetry from DP test automation request
git bisect good a67f7a0b18c09d5b62eafb6d5c2f54e6f6ea6cf1
# good: [dd3dd9829bf9a4ecd55482050745efdd9f7f97fc] drm/amdgpu: Remove unused 
variables from amdgpu_show_fdinfo
git bisect good dd3dd9829bf9a4ecd55482050745efdd9f7f97fc
# good: [b1abb484417ec8edd68df0c9bf8cb1c1fc035fd2] drm/ci: force-enable 
CONFIG_MSM_MMCC_8996 as built-in
git bisect good b1abb484417ec8edd68df0c9bf8cb1c1fc035fd2
# good: [5fa8f128462c5b3b20576b12286dca7fe95b3af1] drm/ci: increase i915 job 
timeout to 1h30m
git bisect good 5fa8f128462c5b3b20576b12286dca7fe95b3af1
# good: [3ddba96b0d7e714dee4db5aed4f7d413be43b4ba] MAINTAINERS: drm/ci: add 
entries for xfail files
git bisect good 3ddba96b0d7e714dee4db5aed4f7d413be43b4ba
# bad: [b70438004a14f4d0f9890b3297cd66248728546c] drm/amdgpu: move buffer funcs 
setting up a level
git bisect bad b70438004a14f4d0f9890b3297cd66248728546c
# skip: [56e449603f0ac580700621a356d35d5716a62ce5] drm/sched: Convert the GPU 
scheduler to variable number of run-queues
git bisect skip 56e449603f0ac580700621a356d35d5716a62ce5
# skip: [c07bf1636f0005f9eb7956404490672286ea59d3] MAINTAINERS: Update the GPU 
Scheduler email
git bisect skip c07bf1636f0005f9eb7956404490672286ea59d3
# only skipped commits left to test
# possible first bad commit: [b70438004a14f4d0f9890b3297cd66248728546c] 
drm/amdgpu: move buffer funcs setting up a level
# possible first bad commit: [c07bf1636f0005f9eb7956404490672286ea59d3] 
MAINTAINERS: Update the GPU Scheduler email
# possible first bad commit: [56e449603f0ac580700621a356d35d5716a62ce5] 
drm/sched: Convert the GPU scheduler to variable number of run-queues


Re: Radeon regression in 6.6 kernel

2023-11-12 Thread Bagas Sanjaya
On Sat, Nov 11, 2023 at 07:46:41PM -0500, Phillip Susi wrote:
> I had been testing some things on a post 6.6-rc5 kernel for a week or
> two and then when I pulled to a post 6.6 release kernel, I found that
> system suspend was broken.  It seems that the radeon driver failed to
> suspend, leaving the display dead, the wayland display server hung, and
> the system still running.  I have been trying to bisect it for the last
> few days and have only been able to narrow it down to the following 3
> commits:
> 
> There are only 'skip'ped commits left to test.
> The first bad commit could be any of:
> 56e449603f0ac580700621a356d35d5716a62ce5
> c07bf1636f0005f9eb7956404490672286ea59d3
> b70438004a14f4d0f9890b3297cd66248728546c
> We cannot bisect more!

Please show the full bisect log, and also tell why these commits are
skipped.

> 
> It appears that there was a late merge in the 6.6 window that originally
> forked from the -rc2, as many of the later commits that I bisected had
> that version number.
> 
> I couldn't get it more narrowed down because I had to skip the
> surrounding commits because they wouldn't even boot up to a gui desktop,
> let alone try to suspend.
> 
> When system suspend fails, I find the following in my syslog after I
> have to magic-sysrq reboot because the the display is dead:
> 
> Nov 11 18:44:39 faldara kernel: PM: suspend entry (deep)
> Nov 11 18:44:39 faldara kernel: Filesystems sync: 0.035 seconds
> Nov 11 18:44:40 faldara kernel: Freezing user space processes
> Nov 11 18:44:40 faldara kernel: Freezing user space processes completed 
> (elapsed 0.001 seconds)
> Nov 11 18:44:40 faldara kernel: OOM killer disabled.
> Nov 11 18:44:40 faldara kernel: Freezing remaining freezable tasks
> Nov 11 18:44:40 faldara kernel: Freezing remaining freezable tasks completed 
> (elapsed 0.001 seconds)
> Nov 11 18:44:40 faldara kernel: printk: Suspending console(s) (use 
> no_console_suspend to debug)
> Nov 11 18:44:40 faldara kernel: serial 00:01: disabled
> Nov 11 18:44:40 faldara kernel: e1000e: EEE TX LPI TIMER: 0011
> Nov 11 18:44:40 faldara kernel: sd 4:0:0:0: [sdb] Synchronizing SCSI cache
> Nov 11 18:44:40 faldara kernel: sd 1:0:0:0: [sda] Synchronizing SCSI cache
> Nov 11 18:44:40 faldara kernel: sd 5:0:0:0: [sdc] Synchronizing SCSI cache
> Nov 11 18:44:40 faldara kernel: sd 4:0:0:0: [sdb] Stopping disk
> Nov 11 18:44:40 faldara kernel: sd 1:0:0:0: [sda] Stopping disk
> Nov 11 18:44:40 faldara kernel: sd 5:0:0:0: [sdc] Stopping disk
> Nov 11 18:44:40 faldara kernel: amdgpu: Move buffer fallback to memcpy 
> unavailable
> Nov 11 18:44:40 faldara kernel: [TTM] Buffer eviction failed
> Nov 11 18:44:40 faldara kernel: [drm] evicting device resources failed
> Nov 11 18:44:40 faldara kernel: amdgpu :03:00.0: PM: pci_pm_suspend(): 
> amdgpu_pmops_suspend+0x0/0x80 [amdgpu] returns -19
> Nov 11 18:44:40 faldara kernel: amdgpu :03:00.0: PM: dpm_run_callback(): 
> pci_pm_suspend+0x0/0x170 returns -19
> Nov 11 18:44:40 faldara kernel: amdgpu :03:00.0: PM: failed to suspend 
> async: error -19
> Nov 11 18:44:40 faldara kernel: PM: Some devices failed to suspend, or early 
> wake event detected
> Nov 11 18:44:40 faldara kernel: xhci_hcd :06:00.0: xHC error in resume, 
> USBSTS 0x401, Reinit
> Nov 11 18:44:40 faldara kernel: usb usb3: root hub lost power or was reset
> Nov 11 18:44:40 faldara kernel: usb usb4: root hub lost power or was reset
> Nov 11 18:44:40 faldara kernel: serial 00:01: activated
> Nov 11 18:44:40 faldara kernel: nvme nvme0: 4/0/0 default/read/poll queues
> Nov 11 18:44:40 faldara kernel: ata8: SATA link down (SStatus 0 SControl 300)
> Nov 11 18:44:40 faldara kernel: ata7: SATA link down (SStatus 0 SControl 300)
> Nov 11 18:44:40 faldara kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 
> SControl 300)
> Nov 11 18:44:40 faldara kernel: ata1: SATA link down (SStatus 4 SControl 300)
> Nov 11 18:44:40 faldara kernel: ata3: SATA link down (SStatus 4 SControl 300)
> Nov 11 18:44:40 faldara kernel: ata4.00: configured for UDMA/133
> Nov 11 18:44:40 faldara kernel: OOM killer enabled.
> Nov 11 18:44:40 faldara kernel: Restarting tasks ... done.
> Nov 11 18:44:40 faldara kernel: random: crng reseeded on system resumption
> Nov 11 18:44:40 faldara kernel: PM: suspend exit
> Nov 11 18:44:40 faldara kernel: PM: suspend entry (s2idle)
> Nov 11 18:44:40 faldara systemd-networkd[384]: enp0s31f6: Gained IPv6LL
> Nov 11 18:44:40 faldara avahi-daemon[668]: Joining mDNS multicast group on 
> interface enp0s31f6.IPv6 with address fe80::3ad5:47ff:fe0f:488a.
> 
> My video card is this:
> 
> 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] 
> Navi 23 (rev c7) (prog-if 00 [VGA controller])
> Subsystem: Gigabyte Technology Co., Ltd Navi 23
> Flags: bus master, fast devsel, latency 0, IRQ 139
> Memory at e000 (64-bit, prefetchable) [size=256M]
> Memory at f000 (64-bit, prefetchable) [size=2M]
> I/O ports at e000 

Radeon regression in 6.6 kernel

2023-11-12 Thread Phillip Susi
I had been testing some things on a post 6.6-rc5 kernel for a week or
two and then when I pulled to a post 6.6 release kernel, I found that
system suspend was broken.  It seems that the radeon driver failed to
suspend, leaving the display dead, the wayland display server hung, and
the system still running.  I have been trying to bisect it for the last
few days and have only been able to narrow it down to the following 3
commits:

There are only 'skip'ped commits left to test.
The first bad commit could be any of:
56e449603f0ac580700621a356d35d5716a62ce5
c07bf1636f0005f9eb7956404490672286ea59d3
b70438004a14f4d0f9890b3297cd66248728546c
We cannot bisect more!

It appears that there was a late merge in the 6.6 window that originally
forked from the -rc2, as many of the later commits that I bisected had
that version number.

I couldn't get it more narrowed down because I had to skip the
surrounding commits because they wouldn't even boot up to a gui desktop,
let alone try to suspend.

When system suspend fails, I find the following in my syslog after I
have to magic-sysrq reboot because the the display is dead:

Nov 11 18:44:39 faldara kernel: PM: suspend entry (deep)
Nov 11 18:44:39 faldara kernel: Filesystems sync: 0.035 seconds
Nov 11 18:44:40 faldara kernel: Freezing user space processes
Nov 11 18:44:40 faldara kernel: Freezing user space processes completed 
(elapsed 0.001 seconds)
Nov 11 18:44:40 faldara kernel: OOM killer disabled.
Nov 11 18:44:40 faldara kernel: Freezing remaining freezable tasks
Nov 11 18:44:40 faldara kernel: Freezing remaining freezable tasks completed 
(elapsed 0.001 seconds)
Nov 11 18:44:40 faldara kernel: printk: Suspending console(s) (use 
no_console_suspend to debug)
Nov 11 18:44:40 faldara kernel: serial 00:01: disabled
Nov 11 18:44:40 faldara kernel: e1000e: EEE TX LPI TIMER: 0011
Nov 11 18:44:40 faldara kernel: sd 4:0:0:0: [sdb] Synchronizing SCSI cache
Nov 11 18:44:40 faldara kernel: sd 1:0:0:0: [sda] Synchronizing SCSI cache
Nov 11 18:44:40 faldara kernel: sd 5:0:0:0: [sdc] Synchronizing SCSI cache
Nov 11 18:44:40 faldara kernel: sd 4:0:0:0: [sdb] Stopping disk
Nov 11 18:44:40 faldara kernel: sd 1:0:0:0: [sda] Stopping disk
Nov 11 18:44:40 faldara kernel: sd 5:0:0:0: [sdc] Stopping disk
Nov 11 18:44:40 faldara kernel: amdgpu: Move buffer fallback to memcpy 
unavailable
Nov 11 18:44:40 faldara kernel: [TTM] Buffer eviction failed
Nov 11 18:44:40 faldara kernel: [drm] evicting device resources failed
Nov 11 18:44:40 faldara kernel: amdgpu :03:00.0: PM: pci_pm_suspend(): 
amdgpu_pmops_suspend+0x0/0x80 [amdgpu] returns -19
Nov 11 18:44:40 faldara kernel: amdgpu :03:00.0: PM: dpm_run_callback(): 
pci_pm_suspend+0x0/0x170 returns -19
Nov 11 18:44:40 faldara kernel: amdgpu :03:00.0: PM: failed to suspend 
async: error -19
Nov 11 18:44:40 faldara kernel: PM: Some devices failed to suspend, or early 
wake event detected
Nov 11 18:44:40 faldara kernel: xhci_hcd :06:00.0: xHC error in resume, 
USBSTS 0x401, Reinit
Nov 11 18:44:40 faldara kernel: usb usb3: root hub lost power or was reset
Nov 11 18:44:40 faldara kernel: usb usb4: root hub lost power or was reset
Nov 11 18:44:40 faldara kernel: serial 00:01: activated
Nov 11 18:44:40 faldara kernel: nvme nvme0: 4/0/0 default/read/poll queues
Nov 11 18:44:40 faldara kernel: ata8: SATA link down (SStatus 0 SControl 300)
Nov 11 18:44:40 faldara kernel: ata7: SATA link down (SStatus 0 SControl 300)
Nov 11 18:44:40 faldara kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 
SControl 300)
Nov 11 18:44:40 faldara kernel: ata1: SATA link down (SStatus 4 SControl 300)
Nov 11 18:44:40 faldara kernel: ata3: SATA link down (SStatus 4 SControl 300)
Nov 11 18:44:40 faldara kernel: ata4.00: configured for UDMA/133
Nov 11 18:44:40 faldara kernel: OOM killer enabled.
Nov 11 18:44:40 faldara kernel: Restarting tasks ... done.
Nov 11 18:44:40 faldara kernel: random: crng reseeded on system resumption
Nov 11 18:44:40 faldara kernel: PM: suspend exit
Nov 11 18:44:40 faldara kernel: PM: suspend entry (s2idle)
Nov 11 18:44:40 faldara systemd-networkd[384]: enp0s31f6: Gained IPv6LL
Nov 11 18:44:40 faldara avahi-daemon[668]: Joining mDNS multicast group on 
interface enp0s31f6.IPv6 with address fe80::3ad5:47ff:fe0f:488a.

My video card is this:

03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 
23 (rev c7) (prog-if 00 [VGA controller])
Subsystem: Gigabyte Technology Co., Ltd Navi 23
Flags: bus master, fast devsel, latency 0, IRQ 139
Memory at e000 (64-bit, prefetchable) [size=256M]
Memory at f000 (64-bit, prefetchable) [size=2M]
I/O ports at e000 [size=256]
Memory at f790 (32-bit, non-prefetchable) [size=1M]
Expansion ROM at 000c [disabled] [size=128K]
Capabilities: [48] Vendor Specific Information: Len=08 
Capabilities: [50] Power Management version 3
Capabilities: [64] Express Legacy Endpoint, MSI 00