Re: A hotplug bug in AMDGPU
On Mon, 3 May 2021, Alex Deucher wrote: > On Mon, May 3, 2021 at 11:40 AM Mikulas Patocka wrote: > > > > Hi > > > > There's a bug with monitor hotplug starting with the kernel 5.7. > > > > I have Radeon RX 570. If I boot the system with the monitor unplugged and > > then plug the monitor via DVI, the kernel 5.6 and below will properly > > initialized graphics; the kernels 5.7+ will not initialize it - and the > > monitor reports no signal. > > > > I bisected the issue and it is caused by the patch > > 4fdda2e66de0b7d37aa27af3c1bbe25ecb2d5408 ("drm/amdgpu/runpm: enable runpm > > on baco capable VI+ asics") > > > > When I remove the code that sets adev->runpm on the kernel 5.12, monitor > > hotplug works correctly. > > This isn't really a hotplug bug per se. That patch enabled runtime > power management which powered down the GPU completely to save power. > Unfortunately when it's powered down, hotplug interrupts won't work > because the entire GPU is powered off. Disabling runtime pm will > allow hotplug interrupts to work, but will cause the GPU to burn a lot > more power. I measured it and it saves 15W. Hard to say if it's worth to pay this for the hotplug capability or not. I can re-activate the card by logging in and typing "rmmod amdgpu;modprobe amdgpu". But what should less technically savvy users do? > I'm not sure what the best solution is. You can manually > wake the card via sysfs (either via the runtime pm controls in > /sys/class/drm/card0/device/power or by reading a sensor on the board > like temperature) then hotplut the monitor or via a direct request to > probe the displays via the display server. > > Alex Mikulas ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
A hotplug bug in AMDGPU
Hi There's a bug with monitor hotplug starting with the kernel 5.7. I have Radeon RX 570. If I boot the system with the monitor unplugged and then plug the monitor via DVI, the kernel 5.6 and below will properly initialized graphics; the kernels 5.7+ will not initialize it - and the monitor reports no signal. I bisected the issue and it is caused by the patch 4fdda2e66de0b7d37aa27af3c1bbe25ecb2d5408 ("drm/amdgpu/runpm: enable runpm on baco capable VI+ asics") When I remove the code that sets adev->runpm on the kernel 5.12, monitor hotplug works correctly. Mikulas Signed-off-by: Mikulas Patocka --- drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c |2 -- 1 file changed, 2 deletions(-) Index: linux-5.12/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c === --- linux-5.12.orig/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 2021-04-26 14:50:53.0 +0200 +++ linux-5.12/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 2021-05-03 16:19:54.0 +0200 @@ -183,8 +183,6 @@ int amdgpu_driver_load_kms(struct amdgpu adev->runpm = true; break; default: - /* enable runpm on CI+ */ - adev->runpm = true; break; } if (adev->runpm) ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] pci: fix incorrect value returned from pcie_get_speed_cap
On Mon, 26 Nov 2018, Bjorn Helgaas wrote: > On Mon, Nov 19, 2018 at 06:47:04PM -0600, Bjorn Helgaas wrote: > > On Tue, Oct 30, 2018 at 12:36:08PM -0400, Mikulas Patocka wrote: > > > The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks. We must mask > > > the register and compare it against them. > > > > > > This patch fixes errors "amdgpu: [powerplay] failed to send message 261 > > > ret is 0" errors when PCIe-v3 card is plugged into PCIe-v1 slot, because > > > the slot is being incorrectly reported as PCIe-v3 capable. > > > > > > Signed-off-by: Mikulas Patocka > > > Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported > > > link speed") > > > Cc: sta...@vger.kernel.org# v4.17+ > > > > > > --- > > > drivers/pci/pci.c |8 > > > 1 file changed, 4 insertions(+), 4 deletions(-) > > > > > > Index: linux-4.19/drivers/pci/pci.c > > > === > > > --- linux-4.19.orig/drivers/pci/pci.c 2018-10-30 16:58:58.0 > > > +0100 > > > +++ linux-4.19/drivers/pci/pci.c 2018-10-30 16:58:58.0 +0100 > > > @@ -5492,13 +5492,13 @@ enum pci_bus_speed pcie_get_speed_cap(st > > > > > > pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, ); > > > if (lnkcap) { > > > - if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB) > > > + if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_16_0GB) > > > return PCIE_SPEED_16_0GT; > > > - else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB) > > > + else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == > > > PCI_EXP_LNKCAP_SLS_8_0GB) > > > return PCIE_SPEED_8_0GT; > > > - else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB) > > > + else if ((lnkcap & PCI_EXP_LNKCAP_SLS) > > > ==PCI_EXP_LNKCAP_SLS_5_0GB) > > > return PCIE_SPEED_5_0GT; > > > - else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB) > > > + else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == > > > PCI_EXP_LNKCAP_SLS_2_5GB) > > > return PCIE_SPEED_2_5GT; > > > } > > > We also need similar fixes in pci_set_bus_speed(), pcie_speeds() > > (hfi1), cobalt_pcie_status_show(), hba_ioctl_callback(), > > qla24xx_pci_info_str(), and maybe a couple other places. > > Does anybody want to volunteer to fix the places above as well? I > found them by grepping for PCI_EXP_LNKCAP, and they're all broken in > ways similar to pcie_get_speed_cap(). Possibly some of these places > could use pcie_get_speed_cap() directly. > > Bjorn > They are not broken, they are masking the value with PCI_EXP_LNKCAP_SLS - that is correct. pci_set_bus_speed: pcie_capability_read_dword(bridge, PCI_EXP_LNKCAP, ); bus->max_bus_speed = pcie_link_speed[linkcap & PCI_EXP_LNKCAP_SLS]; pcie_speeds: if ((linkcap & PCI_EXP_LNKCAP_SLS) != PCI_EXP_LNKCAP_SLS_8_0GB) cobalt_pcie_status_show: just prints the values without doing anything with them hba_ioctl_callback: gai->pci.link_speed_max = (u8)(caps & PCI_EXP_LNKCAP_SLS); gai->pci.link_width_max = (u8)((caps & PCI_EXP_LNKCAP_MLW) >> 4); qla24xx_pci_info_str: lspeed = lstat & PCI_EXP_LNKCAP_SLS; lwidth = (lstat & PCI_EXP_LNKCAP_MLW) >> 4; Mikulas ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] pci: fix incorrect value returned from pcie_get_speed_cap
The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks. We must mask the register and compare it against them. This patch fixes errors "amdgpu: [powerplay] failed to send message 261 ret is 0" errors when PCIe-v3 card is plugged into PCIe-v1 slot, because the slot is being incorrectly reported as PCIe-v3 capable. Signed-off-by: Mikulas Patocka Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported link speed") Cc: sta...@vger.kernel.org # v4.17+ --- drivers/pci/pci.c |8 1 file changed, 4 insertions(+), 4 deletions(-) Index: linux-4.19/drivers/pci/pci.c === --- linux-4.19.orig/drivers/pci/pci.c 2018-10-30 16:58:58.0 +0100 +++ linux-4.19/drivers/pci/pci.c2018-10-30 16:58:58.0 +0100 @@ -5492,13 +5492,13 @@ enum pci_bus_speed pcie_get_speed_cap(st pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, ); if (lnkcap) { - if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB) + if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_16_0GB) return PCIE_SPEED_16_0GT; - else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB) + else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_8_0GB) return PCIE_SPEED_8_0GT; - else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB) + else if ((lnkcap & PCI_EXP_LNKCAP_SLS) ==PCI_EXP_LNKCAP_SLS_5_0GB) return PCIE_SPEED_5_0GT; - else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB) + else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_2_5GB) return PCIE_SPEED_2_5GT; } ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: amdgpu: [powerplay] failed to send message 148 ret is 0
On Mon, 29 Oct 2018, Alex Deucher wrote: > On Thu, Oct 25, 2018 at 4:46 PM Mikulas Patocka wrote: > > > > > > > > On Wed, 24 Oct 2018, Mikulas Patocka wrote: > > > > > Hi > > > > > > I have a Sapphire Pulse RX 570 ITX graphics card. > > > > > > On Linux, I get errors "amdgpu: [powerplay] failed to send message 148 ret > > > is 0" and the system is stuck for several seconds when they happen. The > > > card works, except for these errors and occasional delays. > > > > I've found that PP_PCIE_DPM_MASK causes there errors. If I turn this bit > > off in amdgpu.ppfeaturemask, there are no more any errors. (and turning it > > off also fixes hibernation problems) > > > > Should it be turned off automatically in response to these errors? > > What platform are you running on? Are you running in a VM? The > driver accesses pci config space on the bridge to determine the pcie > gen and lane caps of the platform to determine what clocks and lanes > are valid. See amdgpu_device_get_pcie_info(). It would be good to > figure out why this is not working on your platform. > > Alex It's not a VM. It's an old motherboard with dual socket F. It has HT2000 north bridge and HT1000 south bridge. It has two PCIe-v1 8-lane slots. I've found the bug - pcie_get_speed_cap incorrectly tests the lnkcap variable against values that are not bit-masks, so that the PCIe port is incorrectly reported as 8GB/s capable. When I fix these tests, the errors are gone. Mikulas ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: amdgpu: [powerplay] failed to send message 148 ret is 0
On Wed, 24 Oct 2018, Mikulas Patocka wrote: > Hi > > I have a Sapphire Pulse RX 570 ITX graphics card. > > On Linux, I get errors "amdgpu: [powerplay] failed to send message 148 ret > is 0" and the system is stuck for several seconds when they happen. The > card works, except for these errors and occasional delays. I've found that PP_PCIE_DPM_MASK causes there errors. If I turn this bit off in amdgpu.ppfeaturemask, there are no more any errors. (and turning it off also fixes hibernation problems) Should it be turned off automatically in response to these errors? Mikulas ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
amdgpu: [powerplay] failed to send message 148 ret is 0
Hi I have a Sapphire Pulse RX 570 ITX graphics card. On Linux, I get errors "amdgpu: [powerplay] failed to send message 148 ret is 0" and the system is stuck for several seconds when they happen. The card works, except for these errors and occasional delays. Do you have an idea what could cause these errors or how to debug them? There's nothing to bisect because all the kernels that I tried (back to 4.9) show these errors. I've also tried a kernel from branch "origin/amd-staging-drm-next" from amdgpu git, but it has even more of these errors than 4.18.16. I tried newer firmware from git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git, but it didn't help. Some users suggest that BIOS upgrade may help with this, but there's no BIOS for this card on the Sapphire website. Mikulas [9.371716] [drm] amdgpu kernel modesetting enabled. [9.372068] [drm] initializing kernel modesetting (POLARIS10 0x1002:0x67DF 0x1DA2:0xE343 0xEF). [9.372126] [drm] register mmio base: 0xFF5C [9.372158] [drm] register mmio size: 262144 [9.372194] [drm] probing mlw for device 1166:132 = 3026c81 [9.372228] [drm] add ip block number 0 [9.372260] [drm] add ip block number 1 [9.372292] [drm] add ip block number 2 [9.372324] [drm] add ip block number 3 [9.372356] [drm] add ip block number 4 [9.372387] [drm] add ip block number 5 [9.372419] [drm] add ip block number 6 [9.372452] [drm] add ip block number 7 [9.372483] [drm] add ip block number 8 [9.372530] [drm] UVD is enabled in VM mode [9.372561] [drm] UVD ENC is enabled in VM mode [9.372594] [drm] VCE enabled in VM mode [9.372807] amdgpu :07:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0x [9.373681] ATOM BIOS: 113-D00034-L01 [9.373751] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit [9.373848] amdgpu :07:00.0: VRAM: 4096M 0x00F4 - 0x00F4 (4096M used) [9.373894] amdgpu :07:00.0: GTT: 256M 0x - 0x0FFF [9.373941] [drm] Detected VRAM RAM=4096M, BAR=256M [9.373974] [drm] RAM width 256bits GDDR5 [9.374090] [TTM] Zone kernel: Available graphics memory: 66051588 kiB [9.374124] [TTM] Zone dma32: Available graphics memory: 2097152 kiB [9.374158] [TTM] Initializing pool allocator [9.374193] [TTM] Initializing DMA pool allocator [9.374258] [drm] amdgpu: 4096M of VRAM memory ready [9.374291] [drm] amdgpu: 4096M of GTT memory ready. [9.374331] [drm] GART: num cpu pages 65536, num gpu pages 65536 [9.374419] [drm] PCIE GART of 256M enabled (table at 0x00F40090). [9.374616] [drm] Chained IB support enabled! [9.376667] [drm] Found UVD firmware Version: 1.130 Family ID: 16 [9.379218] [drm] Found VCE firmware Version: 53.26 Binary ID: 3 [9.433581] [drm] DM_PPLIB: values for Engine clock [9.433618] [drm] DM_PPLIB: 3 [9.433649] [drm] DM_PPLIB: 58800 [9.433679] [drm] DM_PPLIB: 95200 [9.433710] [drm] DM_PPLIB: 104100 [9.433740] [drm] DM_PPLIB: 110600 [9.433771] [drm] DM_PPLIB: 116800 [9.433801] [drm] DM_PPLIB: 120900 [9.433831] [drm] DM_PPLIB: 124400 [9.433862] [drm] DM_PPLIB: Validation clocks: [9.433894] [drm] DM_PPLIB:engine_max_clock: 124400 [9.433926] [drm] DM_PPLIB:memory_max_clock: 15 [9.433958] [drm] DM_PPLIB:level : 8 [9.433990] [drm] DM_PPLIB: values for Memory clock [9.434026] [drm] DM_PPLIB: 3 [9.434056] [drm] DM_PPLIB: 10 [9.434087] [drm] DM_PPLIB: 15 [9.434117] [drm] DM_PPLIB: Validation clocks: [9.434148] [drm] DM_PPLIB:engine_max_clock: 124400 [9.434180] [drm] DM_PPLIB:memory_max_clock: 15 [9.434212] [drm] DM_PPLIB:level : 8 [9.434662] [drm] Display Core initialized with v3.1.44! [9.447631] [drm] SADs count is: -2, don't need to read it [9.447676] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [9.447710] [drm] Driver supports precise vblank timestamp query. [9.471140] random: crng init done [9.471202] random: 7 urandom warning(s) missed due to ratelimiting [9.496908] [drm] UVD and UVD ENC initialized successfully. [9.607867] [drm] VCE initialized successfully. [9.609791] [drm] fb mappable at 0xC0E28000 [9.609825] [drm] vram apper at 0xC000 [9.609856] [drm] size 8294400 [9.609887] [drm] fb depth is 24 [9.609917] [drm]pitch is 7680 [9.610027] fbcon: amdgpudrmfb (fb0) is primary device [9.650493] Console: switching to colour frame buffer device 240x67 [9.667224] amdgpu :07:00.0: fb0: amdgpudrmfb frame buffer device [ 10.083684] amdgpu: [powerplay] failed to send message 148 ret is 0 [ 10.904841] amdgpu: [powerplay] last message was failed ret is 0 [ 11.315428] amdgpu: [powerplay]