Re: [git pull] drm for 6.10-rc1

2024-05-16 Thread Sean Christopherson
On Thu, May 16, 2024, Dave Airlie wrote:
> On Thu, 16 May 2024 at 08:56, Linus Torvalds  
> wrote:
> > If the *main* CONFIG_WERROR is on, then it does NOT MATTER if somebody
> > sets CONFIG_DRM_WERROR or not. It's a no-op. It's pointless.

+1

> It's also possible it's just that hey there's a few others in the tree
> 
> KVM_WERROR not tied to it
> PPC_WERROR (why does CXL uses this?)
> AMDGPU, I915 and XE all have !COMPILE_TEST on their variants
> 
> We should probably add !WERROR to all of these at this point.

That creates its own weirdness though, e.g. I guarantee I'll forget about the
global WERROR at some point and wonder why I'm seeing -Werror despite having
KVM_WERROR=n in my .config.  I would rather force KVM_WERROR if WERROR=y, so 
this?

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 2a7f69abcac3..75082c4a9ac4 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -44,6 +44,7 @@ config KVM
select KVM_VFIO
select HAVE_KVM_PM_NOTIFIER if PM
select KVM_GENERIC_HARDWARE_ENABLING
+   select KVM_WERROR if WERROR
help
  Support hosting fully virtualized guest machines using hardware
  virtualization extensions.  You will need a fairly recent
@@ -66,7 +67,7 @@ config KVM_WERROR
# FRAME_WARN, i.e. KVM_WERROR=y with KASAN=y requires special tuning.
# Building KVM with -Werror and KASAN is still doable via enabling
# the kernel-wide WERROR=y.
-   depends on KVM && EXPERT && !KASAN
+   depends on KVM && ((EXPERT && !KASAN) || WERROR)
help
  Add -Werror to the build flags for KVM.


Re: [git pull] drm for 6.10-rc1

2024-05-16 Thread Alex Deucher
On Thu, May 16, 2024 at 4:42 AM Jani Nikula  wrote:
>
> On Wed, 15 May 2024, Linus Torvalds  wrote:
> > On Wed, 15 May 2024 at 16:17, Dave Airlie  wrote:
> >> AMDGPU, I915 and XE all have !COMPILE_TEST on their variants
> >
> > Hmm.  It turns out that I didn't notice the AMDGPU one because my
> > Threadripper - that has AMDGPU enabled - I have actually turned off
> > EXPERT on, so it's hidden by that for me.
> >
> > But yes, both of those should be "depends on !WERROR" too.
>
> Fair enough. Honestly it just didn't occur to me.
>
> The main goal here was to ensure the drm subsystem does not have any
> build warnings, but without halting CI on any non-drm warnings that
> might occasionally creep in and that we can't fix as quickly.
>
> If there was a way to somehow limit WERROR by subdirectories, without
> config options, I'd love to ditch the config.

Right.  Same thing for amdgpu.  Our CI was often breaking due to
-WERROR in other subsystems or with compiler updates.  Maybe it's
better now.

Alex


>
> > Or maybe they should just go away entirely, and be subsumed by the
> > DRM_WERROR thing.
>
> For i915, this was the idea anyway, we just haven't gotten around to it
> yet.
>
>
> BR,
> Jani.
>
>
> --
> Jani Nikula, Intel


Re: [git pull] drm for 6.10-rc1

2024-05-16 Thread Jani Nikula
On Wed, 15 May 2024, Linus Torvalds  wrote:
> On Wed, 15 May 2024 at 16:17, Dave Airlie  wrote:
>> AMDGPU, I915 and XE all have !COMPILE_TEST on their variants
>
> Hmm.  It turns out that I didn't notice the AMDGPU one because my
> Threadripper - that has AMDGPU enabled - I have actually turned off
> EXPERT on, so it's hidden by that for me.
>
> But yes, both of those should be "depends on !WERROR" too.

Fair enough. Honestly it just didn't occur to me.

The main goal here was to ensure the drm subsystem does not have any
build warnings, but without halting CI on any non-drm warnings that
might occasionally creep in and that we can't fix as quickly.

If there was a way to somehow limit WERROR by subdirectories, without
config options, I'd love to ditch the config.

> Or maybe they should just go away entirely, and be subsumed by the
> DRM_WERROR thing.

For i915, this was the idea anyway, we just haven't gotten around to it
yet.


BR,
Jani.


-- 
Jani Nikula, Intel


Re: [git pull] drm for 6.10-rc1

2024-05-15 Thread Paneer Selvam, Arunpravin




On 5/16/2024 8:12 AM, Dave Airlie wrote:

On Thu, 16 May 2024 at 10:06, Dave Airlie  wrote:

On Thu, 16 May 2024 at 09:50, Dave Airlie  wrote:

On Thu, 16 May 2024 at 06:29, Linus Torvalds
 wrote:

On Wed, 15 May 2024 at 13:24, Linus Torvalds
 wrote:

I have to revert both

   a68c7eaa7a8f ("drm/amdgpu: Enable clear page functionality")
   e362b7c8f8c7 ("drm/amdgpu: Modify the contiguous flags behaviour")

to make things build cleanly. Next step: see if it boots and fixes the
problem for me.

Well, perhaps not surprisingly, the WARN_ON() no longer triggers with
this, and everything looks fine.

Let's see if the machine ends up being stable now. It took several
hours for the "scary messages" state to turn into the "hung machine"
state, so they *could* have been independent issues, but it seems a
bit unlikely.

I think that should be fine to do for now.

I think it is also fine to do like I've attached, but I'm not sure if
I'd take that chance.

Scrap that idea, doesn't die, but it makes my system unhappy, like
fbdev missing,

so for quickest path forward, just make the two reverts seems best.

I've reproduced it here, so I'll track it down,

https://lore.kernel.org/amd-gfx/20240514145636.16253-1-arunpravin.paneersel...@amd.com/T/#t

This patch seems to fix it for me, I might just pull it into my tree
and send it to you.
Sorry for the noise, Dave's link is the right fix for this issue. Have 
you already picked it up or should I push it to

drm-misc-next-fixes?

Thanks,
Arun.


Dave.




Re: [git pull] drm for 6.10-rc1

2024-05-15 Thread Dave Airlie
On Thu, 16 May 2024 at 10:06, Dave Airlie  wrote:
>
> On Thu, 16 May 2024 at 09:50, Dave Airlie  wrote:
> >
> > On Thu, 16 May 2024 at 06:29, Linus Torvalds
> >  wrote:
> > >
> > > On Wed, 15 May 2024 at 13:24, Linus Torvalds
> > >  wrote:
> > > >
> > > > I have to revert both
> > > >
> > > >   a68c7eaa7a8f ("drm/amdgpu: Enable clear page functionality")
> > > >   e362b7c8f8c7 ("drm/amdgpu: Modify the contiguous flags behaviour")
> > > >
> > > > to make things build cleanly. Next step: see if it boots and fixes the
> > > > problem for me.
> > >
> > > Well, perhaps not surprisingly, the WARN_ON() no longer triggers with
> > > this, and everything looks fine.
> > >
> > > Let's see if the machine ends up being stable now. It took several
> > > hours for the "scary messages" state to turn into the "hung machine"
> > > state, so they *could* have been independent issues, but it seems a
> > > bit unlikely.
> >
> > I think that should be fine to do for now.
> >
> > I think it is also fine to do like I've attached, but I'm not sure if
> > I'd take that chance.
>
> Scrap that idea, doesn't die, but it makes my system unhappy, like
> fbdev missing,
>
> so for quickest path forward, just make the two reverts seems best.
>
> I've reproduced it here, so I'll track it down,

https://lore.kernel.org/amd-gfx/20240514145636.16253-1-arunpravin.paneersel...@amd.com/T/#t

This patch seems to fix it for me, I might just pull it into my tree
and send it to you.

Dave.


Re: [git pull] drm for 6.10-rc1

2024-05-15 Thread Dave Airlie
On Thu, 16 May 2024 at 09:50, Dave Airlie  wrote:
>
> On Thu, 16 May 2024 at 06:29, Linus Torvalds
>  wrote:
> >
> > On Wed, 15 May 2024 at 13:24, Linus Torvalds
> >  wrote:
> > >
> > > I have to revert both
> > >
> > >   a68c7eaa7a8f ("drm/amdgpu: Enable clear page functionality")
> > >   e362b7c8f8c7 ("drm/amdgpu: Modify the contiguous flags behaviour")
> > >
> > > to make things build cleanly. Next step: see if it boots and fixes the
> > > problem for me.
> >
> > Well, perhaps not surprisingly, the WARN_ON() no longer triggers with
> > this, and everything looks fine.
> >
> > Let's see if the machine ends up being stable now. It took several
> > hours for the "scary messages" state to turn into the "hung machine"
> > state, so they *could* have been independent issues, but it seems a
> > bit unlikely.
>
> I think that should be fine to do for now.
>
> I think it is also fine to do like I've attached, but I'm not sure if
> I'd take that chance.

Scrap that idea, doesn't die, but it makes my system unhappy, like
fbdev missing,

so for quickest path forward, just make the two reverts seems best.

I've reproduced it here, so I'll track it down,

Dave.


Re: [git pull] drm for 6.10-rc1

2024-05-15 Thread Linus Torvalds
On Wed, 15 May 2024 at 16:51, Dave Airlie  wrote:
>
> > Let's see if the machine ends up being stable now. It took several
> > hours for the "scary messages" state to turn into the "hung machine"
> > state, so they *could* have been independent issues, but it seems a
> > bit unlikely.
>
> This worries me actually, it's possible this warn could cause a
> problem, but I'm not convinced it should have machine ending
> properties without some sort of different error at the end, so I'd
> keep an eye open here.

Well, since I'm a big believer in dogfooding, I always run my own
kernel even during the merge window. I don't reboot between each pull,
but I try to basically reboot daily.

And it's entirely possible that the eventual "bad page flags" error -
which is what I think triggered the eventual hang - is something else
that came in during this merge window.

I haven't actually gotten the -mm changes from Andrew yet, but it did
happen in the btrfs kworker, and I have merged the btrfs changes for
6.10.  So maybe they are the cause.

I was blaming the DRM case mainly because it clearly *was* about some
kind of allocation management, and I got a *lot* of those warnings:

$ journalctl -b -1 | grep 'WARNING: CPU' | wc -1
  16015

but let's see if it happens with my amdgpu reverts in place, and no
drm warnings.

It most definitely wouldn't be the first time we had multiple
independent bugs during the merge window ;/

  Linus


Re: [git pull] drm for 6.10-rc1

2024-05-15 Thread Linus Torvalds
On Wed, 15 May 2024 at 16:17, Dave Airlie  wrote:
>
> It's also possible it's just that hey there's a few others in the tree
>
> KVM_WERROR not tied to it
> PPC_WERROR (why does CXL uses this?)

Yeah, that should be fixed too, but at least KVM_WERROR predates the
whole-kernel WERROR.

And PPC_WERROR predates it by over a decade.

But yes, good catch - both of those should be silenced if we already
have the global WERROR enabled.

I mainly notice new questions (because I use "make oldconfig"), so old
pre-existing illogical ones don't trigger my "why are they asking?"
reaction.

> AMDGPU, I915 and XE all have !COMPILE_TEST on their variants

Hmm.  It turns out that I didn't notice the AMDGPU one because my
Threadripper - that has AMDGPU enabled - I have actually turned off
EXPERT on, so it's hidden by that for me.

But yes, both of those should be "depends on !WERROR" too.

Or maybe they should just go away entirely, and be subsumed by the
DRM_WERROR thing.

   Linus


Re: [git pull] drm for 6.10-rc1

2024-05-15 Thread Dave Airlie
On Thu, 16 May 2024 at 06:29, Linus Torvalds
 wrote:
>
> On Wed, 15 May 2024 at 13:24, Linus Torvalds
>  wrote:
> >
> > I have to revert both
> >
> >   a68c7eaa7a8f ("drm/amdgpu: Enable clear page functionality")
> >   e362b7c8f8c7 ("drm/amdgpu: Modify the contiguous flags behaviour")
> >
> > to make things build cleanly. Next step: see if it boots and fixes the
> > problem for me.
>
> Well, perhaps not surprisingly, the WARN_ON() no longer triggers with
> this, and everything looks fine.
>
> Let's see if the machine ends up being stable now. It took several
> hours for the "scary messages" state to turn into the "hung machine"
> state, so they *could* have been independent issues, but it seems a
> bit unlikely.

This worries me actually, it's possible this warn could cause a
problem, but I'm not convinced it should have machine ending
properties without some sort of different error at the end, so I'd
keep an eye open here.

Dave.


Re: [git pull] drm for 6.10-rc1

2024-05-15 Thread Dave Airlie
On Thu, 16 May 2024 at 06:29, Linus Torvalds
 wrote:
>
> On Wed, 15 May 2024 at 13:24, Linus Torvalds
>  wrote:
> >
> > I have to revert both
> >
> >   a68c7eaa7a8f ("drm/amdgpu: Enable clear page functionality")
> >   e362b7c8f8c7 ("drm/amdgpu: Modify the contiguous flags behaviour")
> >
> > to make things build cleanly. Next step: see if it boots and fixes the
> > problem for me.
>
> Well, perhaps not surprisingly, the WARN_ON() no longer triggers with
> this, and everything looks fine.
>
> Let's see if the machine ends up being stable now. It took several
> hours for the "scary messages" state to turn into the "hung machine"
> state, so they *could* have been independent issues, but it seems a
> bit unlikely.

I think that should be fine to do for now.

I think it is also fine to do like I've attached, but I'm not sure if
I'd take that chance.

Two questions for Arunpravin (and Alex):

Is this fix correct, and can we get a good explanation of it?

Where did this error sneak in? Is the problem in the amdgpu tree, or
was it a drm-next only problem? If so perhaps we need to discuss
moving amdgpu more into drm-tip to catch this sort of problem.

Dave.
From 085b89278f296c40e86f5d1e1bcc1017c39f4002 Mon Sep 17 00:00:00 2001
From: Dave Airlie 
Date: Thu, 16 May 2024 09:46:37 +1000
Subject: [PATCH] drm/buddy: convert WARN_ON to an if + continue

This WARN_ON triggers a lot, but I don't think the __force_merge
path always has to succeed, so just return a failure here instead
of warn on to let other paths handle the allocation.

(Not 100% sure on this patch - airlied).
---
 drivers/gpu/drm/drm_buddy.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
index 284ebae71cc4..6b90ec6eefa8 100644
--- a/drivers/gpu/drm/drm_buddy.c
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -195,8 +195,9 @@ static int __force_merge(struct drm_buddy *mm,
 			if (!drm_buddy_block_is_free(buddy))
 continue;
 
-			WARN_ON(drm_buddy_block_is_clear(block) ==
-drm_buddy_block_is_clear(buddy));
+			if (drm_buddy_block_is_clear(block) !=
+			drm_buddy_block_is_clear(buddy))
+continue;
 
 			/*
 			 * If the prev block is same as buddy, don't access the
-- 
2.44.0



Re: [git pull] drm for 6.10-rc1

2024-05-15 Thread Dave Airlie
On Thu, 16 May 2024 at 08:56, Linus Torvalds
 wrote:
>
> On Wed, 15 May 2024 at 15:45, Dave Airlie  wrote:
> >
> >   The drm subsystem enables more warnings than the kernel default, 
> > so
> >   this config option is disabled by default.
>
> Irrelevant.
>
> If the *main* CONFIG_WERROR is on, then it does NOT MATTER if somebody
> sets CONFIG_DRM_WERROR or not. It's a no-op. It's pointless.
>
> And that means that it's also entirely pointless to ask. It's only annoying.
>
> > depends on DRM && EXPERT
> >
> > so we aren't throwing it at random users.
>
> Yes you are.
>
> Because - rightly or wrongly - distros enable EXPERT by default. At
> least Fedora does. So any user that starts from a distro config will
> have EXPERT enabled.
>
> > should we rename it CONFIG_DRM_WERROR_MORE or something?
>
> Renaming does nothing. If it's pointless, it's pointless even if it's renamed.
>
> It needs to have a
>
>depends on !WERROR
>
> because if WERROR is already true, then it's stupid and wrong to ask AGAIN.
>
> To summarize: if the main WERROR is enabled, then the DRM tree is
> *ALREADY* built with WERROR. Asking for DRM_WERROR is wrong.
>
> I keep harping on bad config variables because our kernel config thing
> is already much too messy and is by far the most difficult part of
> building your own kernel.
>
> Everything else is literally just "make" followed by "make
> modules_install" and "make install". Very straightforward.
>
> But doing a kernel config? Nasty. And made nastier by bad and
> nonsensical questions.

It's also possible it's just that hey there's a few others in the tree

KVM_WERROR not tied to it
PPC_WERROR (why does CXL uses this?)
AMDGPU, I915 and XE all have !COMPILE_TEST on their variants

We should probably add !WERROR to all of these at this point.

Adding Jani who was the initial author of

commit f89632a9e5fa6c4787c14458cd42a9ef42025434
Author: Jani Nikula 
Date:   Tue Mar 5 11:07:36 2024 +0200

drm: Add CONFIG_DRM_WERROR

where I see we actually removed the !COMPILE_TEST check in v2.

Dave.


Re: [git pull] drm for 6.10-rc1

2024-05-15 Thread Linus Torvalds
On Wed, 15 May 2024 at 15:45, Dave Airlie  wrote:
>
>   The drm subsystem enables more warnings than the kernel default, so
>   this config option is disabled by default.

Irrelevant.

If the *main* CONFIG_WERROR is on, then it does NOT MATTER if somebody
sets CONFIG_DRM_WERROR or not. It's a no-op. It's pointless.

And that means that it's also entirely pointless to ask. It's only annoying.

> depends on DRM && EXPERT
>
> so we aren't throwing it at random users.

Yes you are.

Because - rightly or wrongly - distros enable EXPERT by default. At
least Fedora does. So any user that starts from a distro config will
have EXPERT enabled.

> should we rename it CONFIG_DRM_WERROR_MORE or something?

Renaming does nothing. If it's pointless, it's pointless even if it's renamed.

It needs to have a

   depends on !WERROR

because if WERROR is already true, then it's stupid and wrong to ask AGAIN.

To summarize: if the main WERROR is enabled, then the DRM tree is
*ALREADY* built with WERROR. Asking for DRM_WERROR is wrong.

I keep harping on bad config variables because our kernel config thing
is already much too messy and is by far the most difficult part of
building your own kernel.

Everything else is literally just "make" followed by "make
modules_install" and "make install". Very straightforward.

But doing a kernel config? Nasty. And made nastier by bad and
nonsensical questions.

Linus


Re: [git pull] drm for 6.10-rc1

2024-05-15 Thread Dave Airlie
On Thu, 16 May 2024 at 06:43, Linus Torvalds
 wrote:
>
> On Tue, 14 May 2024 at 23:21, Dave Airlie  wrote:
> >
> > This is the main pull request for the drm subsystems for 6.10.
>
> .. and now that I look more at this pull request, I find other things wrong.
>
> Why is the DRM code asking if I want to enable -Werror? I have Werror
> enabled *already*.
>
> I hate stupid config questions. They only confuse users.
>
> If the global WERROR config is enabled, then the DRM config certainly
> shouldn't ask whether you want even more -Werror. It does nothing but
> annoy people.
>
> And no, we are not going to have subsystems that can *weaken* the
> existing CONFIG_WERROR. Happily, that doesn't seem to be what the DRM
> code wants to do, it just wants to add -Werror, but as mentioned, its'
> crazy to do that when we already have it globally enabled.
>
> Now, it might make more sense to ask if you want -Wextra. A lot of
> those warnings are bogus.

The help says:

  The drm subsystem enables more warnings than the kernel default, so
  this config option is disabled by default.

It's also

depends on DRM && EXPERT

so we aren't throwing it at random users.

should we rename it CONFIG_DRM_WERROR_MORE or something?

Dave.


Re: [git pull] drm for 6.10-rc1

2024-05-15 Thread Linus Torvalds
On Tue, 14 May 2024 at 23:21, Dave Airlie  wrote:
>
> This is the main pull request for the drm subsystems for 6.10.

.. and now that I look more at this pull request, I find other things wrong.

Why is the DRM code asking if I want to enable -Werror? I have Werror
enabled *already*.

I hate stupid config questions. They only confuse users.

If the global WERROR config is enabled, then the DRM config certainly
shouldn't ask whether you want even more -Werror. It does nothing but
annoy people.

And no, we are not going to have subsystems that can *weaken* the
existing CONFIG_WERROR. Happily, that doesn't seem to be what the DRM
code wants to do, it just wants to add -Werror, but as mentioned, its'
crazy to do that when we already have it globally enabled.

Now, it might make more sense to ask if you want -Wextra. A lot of
those warnings are bogus.

   Linus


Re: [git pull] drm for 6.10-rc1

2024-05-15 Thread Linus Torvalds
On Wed, 15 May 2024 at 13:24, Linus Torvalds
 wrote:
>
> I have to revert both
>
>   a68c7eaa7a8f ("drm/amdgpu: Enable clear page functionality")
>   e362b7c8f8c7 ("drm/amdgpu: Modify the contiguous flags behaviour")
>
> to make things build cleanly. Next step: see if it boots and fixes the
> problem for me.

Well, perhaps not surprisingly, the WARN_ON() no longer triggers with
this, and everything looks fine.

Let's see if the machine ends up being stable now. It took several
hours for the "scary messages" state to turn into the "hung machine"
state, so they *could* have been independent issues, but it seems a
bit unlikely.

   Linus


Re: [git pull] drm for 6.10-rc1

2024-05-15 Thread Linus Torvalds
On Wed, 15 May 2024 at 13:21, Linus Torvalds
 wrote:
>
> I guess I'll try to revert the later commit that enables it for amdgpu
> (commit a68c7eaa7a8f) and see if it at least makes the horrendous
> messages go away.

I have to revert both

  a68c7eaa7a8f ("drm/amdgpu: Enable clear page functionality")
  e362b7c8f8c7 ("drm/amdgpu: Modify the contiguous flags behaviour")

to make things build cleanly. Next step: see if it boots and fixes the
problem for me.

  Linus


Re: [git pull] drm for 6.10-rc1

2024-05-15 Thread Linus Torvalds
On Wed, 15 May 2024 at 13:06, Linus Torvalds
 wrote:
>
> Hmm. There's something seriously wrong with amdgpu.
>
> I'm getting a ton of__force_merge warnings:
>
>   WARNING: CPU: 0 PID: 1069 at drivers/gpu/drm/drm_buddy.c:199
> __force_merge+0x14f/0x180 [drm_buddy]

Adding likely culprits to the participants, since it looks like this
is all new with commit 96950929eb23 ("drm/buddy: Implement tracking
clear page feature").

Sadly I can't juist revert that commit to check, because there are
many subsequent commits that then depend on it.

I guess I'll try to revert the later commit that enables it for amdgpu
(commit a68c7eaa7a8f) and see if it at least makes the horrendous
messages go away.

Anyway, this is some old Radeon graphics card in my Threadripper:

49:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
[AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)
(prog-if 00 [VGA controller])
Subsystem: Sapphire Technology Limited Radeon RX 570 Pulse 4GB
Flags: bus master, fast devsel, latency 0, IRQ 130, IOMMU group 32
Memory at c000 (64-bit, prefetchable) [size=256M]
Memory at d000 (64-bit, prefetchable) [size=2M]
I/O ports at 8000 [size=256]
Memory at d1c0 (32-bit, non-prefetchable) [size=256K]
Expansion ROM at 000c [disabled] [size=128K]
Capabilities: 
Kernel driver in use: amdgpu
Kernel modules: amdgpu

I think it's a "Sapphire Radeon Pulse RX 580" or something like that.

Linus


Re: [git pull] drm for 6.10-rc1

2024-05-15 Thread Linus Torvalds
On Tue, 14 May 2024 at 23:21, Dave Airlie  wrote:
>
> In drivers the main thing is a new driver for ARM Mali firmware based
> GPUs, otherwise there are a lot of changes to amdgpu/xe/i915/msm and
> scattered changes to everything else.

Hmm. There's something seriously wrong with amdgpu.

I'm getting a ton of__force_merge warnings:

  WARNING: CPU: 0 PID: 1069 at drivers/gpu/drm/drm_buddy.c:199
__force_merge+0x14f/0x180 [drm_buddy]
  Modules linked in: hid_logitech_hidpp hid_logitech_dj uas
usb_storage amdgpu drm_ttm_helper ttm video drm_exec
drm_suballoc_helper amdxcp drm_buddy gpu_sched drm_display_helper
drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel drm
ghash_clmulni_intel igb atlantic nvme dca macsec ccp i2c_algo_bit
nvme_core sp5100_tco wmi ip6_tables ip_tables fuse
  CPU: 0 PID: 1069 Comm: plymouthd Not tainted 6.9.0-07381-g3860ca371740 #60
  Hardware name: Gigabyte Technology Co., Ltd. TRX40 AORUS
MASTER/TRX40 AORUS MASTER, BIOS F7 09/07/2022
  RIP: 0010:__force_merge+0x14f/0x180 [drm_buddy]
  Code: 74 0d 49 8b 44 24 18 48 d3 e0 49 29 44 24 30 4c 89 e7 ba 01 00
00 00 e8 9f 00 00 00 44 39 e8 73 1f 49 8b 04 24 e9 25 ff ff ff <0f> 0b
4c 39 c3 75 a3 eb 99 b8 f4 ff ff ff c3 b8 f4 ff ff ff eb 02
  RSP: 0018:b87a81cb7908 EFLAGS: 00010246
  RAX: 9b1915de8000 RBX: 9b1919478288 RCX: 0800
  RDX: 9b19194782f8 RSI: 9b19194782d0 RDI: 9b19194782b0
  RBP:  R08: 9b1919478288 R09: 1000
  R10: 0800 R11:  R12: 9b192590fa18
  R13: 000d R14: 1000 R15: 
  FS:  7fa06bfa9740() GS:9b281e00() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 555adb857000 CR3: 00011b516000 CR4: 00350ef0
  Call Trace:
   ? __force_merge+0x14f/0x180 [drm_buddy]
   drm_buddy_alloc_blocks+0x249/0x400 [drm_buddy]
   ? __cond_resched+0x16/0x40
   amdgpu_vram_mgr_new+0x204/0x3f0 [amdgpu]
   ttm_resource_alloc+0x31/0x120 [ttm]
   ttm_bo_alloc_resource+0xbc/0x260 [ttm]
   ttm_bo_validate+0x9f/0x210 [ttm]
   ttm_bo_init_reserved+0x103/0x130 [ttm]
   amdgpu_bo_create+0x246/0x400 [amdgpu]
   ? amdgpu_bo_destroy+0x70/0x70 [amdgpu]
   amdgpu_bo_create_user+0x29/0x40 [amdgpu]
   amdgpu_mode_dumb_create+0x108/0x190 [amdgpu]
   ? amdgpu_bo_destroy+0x70/0x70 [amdgpu]
   ? drm_mode_create_dumb+0xa0/0xa0 [drm]
   drm_ioctl_kernel+0xad/0xd0 [drm]
   drm_ioctl+0x330/0x4b0 [drm]
   ? drm_mode_create_dumb+0xa0/0xa0 [drm]
   amdgpu_drm_ioctl+0x41/0x80 [amdgpu]
   __x64_sys_ioctl+0xd2a/0xe00
   ? update_process_times+0x89/0xa0
   ? tick_nohz_handler+0xe2/0x120
   ? timerqueue_add+0x94/0xa0
   ? __hrtimer_run_queues+0x12b/0x250
   ? ktime_get+0x34/0xb0
   ? lapic_next_event+0x12/0x20
   ? clockevents_program_event+0x78/0xd0
   ? hrtimer_interrupt+0x118/0x390
   ? sched_clock+0x5/0x10
   do_syscall_64+0x68/0x130
   ? __irq_exit_rcu+0x53/0xb0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53

and eventually the whole thing just crashes entirely, with a bad page
state in the VM:

  BUG: Bad page state in process kworker/u261:13  pfn:31fb9a
  page: refcount:0 mapcount:0 mapping:ff0b239e index:0x37ce8
pfn:0x31fb9a
  aops:btree_aops ino:1
  flags: 
0x2fffc60020c(referenced|uptodate|workingset|node=0|zone=2|lastcpupid=0x3fff)
  page_type: 0x()

which comes from a btrfs worker (btrfs-delayed-meta
btrfs_work_helper), but I would not be surprised if that was caused by
whatever odd thing is going on with the DRM code. IOW, it *looks* like
this code ends up just corrupting memory in horrible ways.

Linus

Linus


Re: [git pull] drm for 6.10-rc1

2024-05-15 Thread pr-tracker-bot
The pull request you sent on Wed, 15 May 2024 16:20:56 +1000:

> https://gitlab.freedesktop.org/drm/kernel.git tags/drm-next-2024-05-15

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/db5d28c0bfe566908719bec8e25443aabecbb802

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html