Am Montag, dem 02.02.2026 um 10:11 -0600 schrieb Mario Limonciello:
> On 2/2/26 8:35 AM, Christian König wrote:
> > On 2/2/26 15:25, Mario Limonciello wrote:
> > > On 1/31/26 6:24 PM, Bert Karwatzki wrote:
> > > > This reverts commit 7294863a6f01248d72b61d38478978d638641bee.
> > > >
> > > > This commit was erroneously applied again after commit 0ab5d711ec74
> > > > ("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device")
> > > > removed it, leading to very hard to debug crashes, when used with a
> > > > system with two
> > > > AMD GPUs of which only one supports ASPM.
> > > >
> > > > Link:
> > > > https://lore.kernel.org/linux-acpi/[email protected]/
> > > > Link: https://github.com/acpica/acpica/issues/1060
> > > > Fixes: 0ab5d711ec74 ("drm/amd: Refactor `amdgpu_aspm` to be evaluated
> > > > per device")
> > > >
> > > > Signed-off-by: Bert Karwatzki <[email protected]>
> > > > ---
> > >
> > > Amazing detective work, thanks so much.
> > >
> > > This added the code initially:
> > > cba07cce39ace drm/amd: Check if ASPM is enabled from PCIe subsystem
> > >
> > > This effectively removed it:
> > > 0ab5d711ec74d drm/amd: Refactor `amdgpu_aspm` to be evaluated per device
> > >
> > > This was the accidental re-apply:
> > > 7294863a6f012 drm/amd: Check if ASPM is enabled from PCIe subsystem
> > >
> > > It looks like this as right on the edge of the 5.17-rc6 and 5.18-rc1.
> > > I think drm-fixes-2022-02-25 and amd-drm-next-5.18-2022-02-25 ended up
> > > with different content.
> > >
> > > Nonethless this is the correct change and I've applied it to
> > > amd-staging-drm-next.
> > >
> > > Reviewed-by: Mario Limonciello (AMD) <[email protected]>
> >
> > Reviewed-by: Christian König <[email protected]>
> >
> > There is just one major question left: Why is disabling ASPM causing
> > problems?
> >
>
> My theory is that it's a mismatch of PCIe core and AMDGPU. IE if the
> PCIe core thinks it's enabled but amdgpu thinks it is disabled can hit
> some corner scenarios.
That's also my theory. In my case the discrete GPU is probed first
[ 1.652505] [ T194] amdgpu 0000:03:00.0: enabling device (0000 -> 0002)
[ 1.658662] [ T194] amdgpu 0000:03:00.0: amdgpu: initializing kernel
modesetting (DIMGREY_CAVEFISH 0x1002:0x73FF 0x1462:0x1313 0xC3).
[ 1.665045] [ T194] amdgpu 0000:03:00.0: amdgpu: register mmio base:
0xFCA00000
[ 1.671399] [ T194] amdgpu 0000:03:00.0: amdgpu: register mmio size:
1048576
[ 1.681596] [ T194] amdgpu 0000:03:00.0: amdgpu: detected ip block number
0 <common_v1_0_0> (nv_common)
then the built-in GPU is probed and set amdgpu_aspm = 0.
[ 4.883191] [ T194] amdgpu 0000:08:00.0: enabling device (0006 -> 0007)
[ 4.890078] [ T194] amdgpu 0000:08:00.0: amdgpu: initializing kernel
modesetting (RENOIR 0x1002:0x1638 0x1462:0x1313 0xC5).
[ 4.895907] [ T194] amdgpu 0000:08:00.0: amdgpu: register mmio base:
0xFC900000
[ 4.901640] [ T194] amdgpu 0000:08:00.0: amdgpu: register mmio size:
524288
[ 4.909833] [ T194] amdgpu 0000:08:00.0: amdgpu: detected ip block number
0 <common_v2_0_0> (soc15_common)
I'm going to monitor calls to amdgpu_device_should_use_aspm() to check if it's
called during
the suspend/resumes cycle giving the wrong answer (i.e. false when ASPM is
actually enabled)
Bert Karwatzki