Am Montag, dem 02.02.2026 um 10:11 -0600 schrieb Mario Limonciello:
> On 2/2/26 8:35 AM, Christian König wrote:
> > On 2/2/26 15:25, Mario Limonciello wrote:
> > > On 1/31/26 6:24 PM, Bert Karwatzki wrote:
> > > > This reverts commit 7294863a6f01248d72b61d38478978d638641bee.
> > > > 
> > > > This commit was erroneously applied again after commit 0ab5d711ec74
> > > > ("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device")
> > > > removed it, leading to very hard to debug crashes, when used with a 
> > > > system with two
> > > > AMD GPUs of which only one supports ASPM.
> > > > 
> > > > Link: 
> > > > https://lore.kernel.org/linux-acpi/[email protected]/
> > > > Link: https://github.com/acpica/acpica/issues/1060
> > > > Fixes: 0ab5d711ec74 ("drm/amd: Refactor `amdgpu_aspm` to be evaluated 
> > > > per device")
> > > > 
> > > > Signed-off-by: Bert Karwatzki <[email protected]>
> > > > ---
> > > 
> > > Amazing detective work, thanks so much.
> > > 
> > > This added the code initially:
> > > cba07cce39ace drm/amd: Check if ASPM is enabled from PCIe subsystem
> > > 
> > > This effectively removed it:
> > > 0ab5d711ec74d drm/amd: Refactor `amdgpu_aspm` to be evaluated per device
> > > 
> > > This was the accidental re-apply:
> > > 7294863a6f012 drm/amd: Check if ASPM is enabled from PCIe subsystem
> > > 
> > > It looks like this as right on the edge of the 5.17-rc6 and 5.18-rc1.
> > > I think drm-fixes-2022-02-25 and amd-drm-next-5.18-2022-02-25 ended up 
> > > with different content.
> > > 
> > > Nonethless this is the correct change and I've applied it to 
> > > amd-staging-drm-next.
> > > 
> > > Reviewed-by: Mario Limonciello (AMD) <[email protected]>
> > 
> > Reviewed-by: Christian König <[email protected]>
> > 
> > There is just one major question left: Why is disabling ASPM causing 
> > problems?
> > 
> 
> My theory is that it's a mismatch of PCIe core and AMDGPU.  IE if the 
> PCIe core thinks it's enabled but amdgpu thinks it is disabled can hit 
> some corner scenarios.

That's also my theory. In my case the discrete GPU is probed first

[    1.652505] [    T194] amdgpu 0000:03:00.0: enabling device (0000 -> 0002)
[    1.658662] [    T194] amdgpu 0000:03:00.0: amdgpu: initializing kernel 
modesetting (DIMGREY_CAVEFISH 0x1002:0x73FF 0x1462:0x1313 0xC3).
[    1.665045] [    T194] amdgpu 0000:03:00.0: amdgpu: register mmio base: 
0xFCA00000
[    1.671399] [    T194] amdgpu 0000:03:00.0: amdgpu: register mmio size: 
1048576
[    1.681596] [    T194] amdgpu 0000:03:00.0: amdgpu: detected ip block number 
0 <common_v1_0_0> (nv_common)

then the built-in GPU is probed and set amdgpu_aspm = 0.

[    4.883191] [    T194] amdgpu 0000:08:00.0: enabling device (0006 -> 0007)
[    4.890078] [    T194] amdgpu 0000:08:00.0: amdgpu: initializing kernel 
modesetting (RENOIR 0x1002:0x1638 0x1462:0x1313 0xC5).
[    4.895907] [    T194] amdgpu 0000:08:00.0: amdgpu: register mmio base: 
0xFC900000
[    4.901640] [    T194] amdgpu 0000:08:00.0: amdgpu: register mmio size: 
524288
[    4.909833] [    T194] amdgpu 0000:08:00.0: amdgpu: detected ip block number 
0 <common_v2_0_0> (soc15_common)

I'm going to monitor calls to amdgpu_device_should_use_aspm() to check if it's 
called during
the suspend/resumes cycle giving the wrong answer (i.e. false when ASPM is 
actually enabled)

Bert Karwatzki

Reply via email to