Re: A hotplug bug in AMDGPU

2021-05-05 Thread Mikulas Patocka



On Mon, 3 May 2021, Alex Deucher wrote:

> On Mon, May 3, 2021 at 11:40 AM Mikulas Patocka  wrote:
> >
> > Hi
> >
> > There's a bug with monitor hotplug starting with the kernel 5.7.
> >
> > I have Radeon RX 570. If I boot the system with the monitor unplugged and
> > then plug the monitor via DVI, the kernel 5.6 and below will properly
> > initialized graphics; the kernels 5.7+ will not initialize it - and the
> > monitor reports no signal.
> >
> > I bisected the issue and it is caused by the patch
> > 4fdda2e66de0b7d37aa27af3c1bbe25ecb2d5408 ("drm/amdgpu/runpm: enable runpm
> > on baco capable VI+ asics")
> >
> > When I remove the code that sets adev->runpm on the kernel 5.12, monitor
> > hotplug works correctly.
> 
> This isn't really a hotplug bug per se.  That patch enabled runtime
> power management which powered down the GPU completely to save power.
> Unfortunately when it's powered down, hotplug interrupts won't work
> because the entire GPU is powered off.  Disabling runtime pm will
> allow hotplug interrupts to work, but will cause the GPU to burn a lot
> more power.

I measured it and it saves 15W. Hard to say if it's worth to pay this for 
the hotplug capability or not.

I can re-activate the card by logging in and typing "rmmod amdgpu;modprobe 
amdgpu". But what should less technically savvy users do?

> I'm not sure what the best solution is.  You can manually
> wake the card via sysfs (either via the runtime pm controls in
> /sys/class/drm/card0/device/power or by reading a sensor on the board
> like temperature) then hotplut the monitor or via a direct request to
> probe the displays via the display server.
> 
> Alex

Mikulas

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


A hotplug bug in AMDGPU

2021-05-03 Thread Mikulas Patocka
Hi

There's a bug with monitor hotplug starting with the kernel 5.7.

I have Radeon RX 570. If I boot the system with the monitor unplugged and 
then plug the monitor via DVI, the kernel 5.6 and below will properly 
initialized graphics; the kernels 5.7+ will not initialize it - and the 
monitor reports no signal.

I bisected the issue and it is caused by the patch 
4fdda2e66de0b7d37aa27af3c1bbe25ecb2d5408 ("drm/amdgpu/runpm: enable runpm 
on baco capable VI+ asics")

When I remove the code that sets adev->runpm on the kernel 5.12, monitor 
hotplug works correctly.

Mikulas


Signed-off-by: Mikulas Patocka 

---
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c |2 --
 1 file changed, 2 deletions(-)

Index: linux-5.12/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
===
--- linux-5.12.orig/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 2021-04-26 
14:50:53.0 +0200
+++ linux-5.12/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c  2021-05-03 
16:19:54.0 +0200
@@ -183,8 +183,6 @@ int amdgpu_driver_load_kms(struct amdgpu
adev->runpm = true;
break;
default:
-   /* enable runpm on CI+ */
-   adev->runpm = true;
break;
}
if (adev->runpm)

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] pci: fix incorrect value returned from pcie_get_speed_cap

2018-11-27 Thread Mikulas Patocka


On Mon, 26 Nov 2018, Bjorn Helgaas wrote:

> On Mon, Nov 19, 2018 at 06:47:04PM -0600, Bjorn Helgaas wrote:
> > On Tue, Oct 30, 2018 at 12:36:08PM -0400, Mikulas Patocka wrote:
> > > The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks. We must mask
> > > the register and compare it against them.
> > > 
> > > This patch fixes errors "amdgpu: [powerplay] failed to send message 261
> > > ret is 0" errors when PCIe-v3 card is plugged into PCIe-v1 slot, because
> > > the slot is being incorrectly reported as PCIe-v3 capable.
> > > 
> > > Signed-off-by: Mikulas Patocka 
> > > Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported 
> > > link speed")
> > > Cc: sta...@vger.kernel.org# v4.17+
> > > 
> > > ---
> > >  drivers/pci/pci.c |8 
> > >  1 file changed, 4 insertions(+), 4 deletions(-)
> > > 
> > > Index: linux-4.19/drivers/pci/pci.c
> > > ===
> > > --- linux-4.19.orig/drivers/pci/pci.c 2018-10-30 16:58:58.0 
> > > +0100
> > > +++ linux-4.19/drivers/pci/pci.c  2018-10-30 16:58:58.0 +0100
> > > @@ -5492,13 +5492,13 @@ enum pci_bus_speed pcie_get_speed_cap(st
> > >  
> > >   pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, );
> > >   if (lnkcap) {
> > > - if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB)
> > > + if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_16_0GB)
> > >   return PCIE_SPEED_16_0GT;
> > > - else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB)
> > > + else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == 
> > > PCI_EXP_LNKCAP_SLS_8_0GB)
> > >   return PCIE_SPEED_8_0GT;
> > > - else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB)
> > > + else if ((lnkcap & PCI_EXP_LNKCAP_SLS) 
> > > ==PCI_EXP_LNKCAP_SLS_5_0GB)
> > >   return PCIE_SPEED_5_0GT;
> > > - else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB)
> > > + else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == 
> > > PCI_EXP_LNKCAP_SLS_2_5GB)
> > >   return PCIE_SPEED_2_5GT;
> > >   }
> 
> > We also need similar fixes in pci_set_bus_speed(), pcie_speeds()
> > (hfi1), cobalt_pcie_status_show(), hba_ioctl_callback(),
> > qla24xx_pci_info_str(), and maybe a couple other places.
> 
> Does anybody want to volunteer to fix the places above as well?  I
> found them by grepping for PCI_EXP_LNKCAP, and they're all broken in
> ways similar to pcie_get_speed_cap().  Possibly some of these places
> could use pcie_get_speed_cap() directly.
> 
> Bjorn
> 

They are not broken, they are masking the value with PCI_EXP_LNKCAP_SLS - 
that is correct.

pci_set_bus_speed:
pcie_capability_read_dword(bridge, PCI_EXP_LNKCAP, );
bus->max_bus_speed = pcie_link_speed[linkcap & 
PCI_EXP_LNKCAP_SLS];

pcie_speeds:
if ((linkcap & PCI_EXP_LNKCAP_SLS) != PCI_EXP_LNKCAP_SLS_8_0GB)

cobalt_pcie_status_show:
just prints the values without doing anything with them

hba_ioctl_callback:
gai->pci.link_speed_max = (u8)(caps & PCI_EXP_LNKCAP_SLS);

gai->pci.link_width_max = (u8)((caps & PCI_EXP_LNKCAP_MLW) >> 4);

qla24xx_pci_info_str:
lspeed = lstat & PCI_EXP_LNKCAP_SLS;
lwidth = (lstat & PCI_EXP_LNKCAP_MLW) >> 4;

Mikulas
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] pci: fix incorrect value returned from pcie_get_speed_cap

2018-10-30 Thread Mikulas Patocka
The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks. We must mask
the register and compare it against them.

This patch fixes errors "amdgpu: [powerplay] failed to send message 261
ret is 0" errors when PCIe-v3 card is plugged into PCIe-v1 slot, because
the slot is being incorrectly reported as PCIe-v3 capable.

Signed-off-by: Mikulas Patocka 
Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported link 
speed")
Cc: sta...@vger.kernel.org  # v4.17+

---
 drivers/pci/pci.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-4.19/drivers/pci/pci.c
===
--- linux-4.19.orig/drivers/pci/pci.c   2018-10-30 16:58:58.0 +0100
+++ linux-4.19/drivers/pci/pci.c2018-10-30 16:58:58.0 +0100
@@ -5492,13 +5492,13 @@ enum pci_bus_speed pcie_get_speed_cap(st
 
pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, );
if (lnkcap) {
-   if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB)
+   if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_16_0GB)
return PCIE_SPEED_16_0GT;
-   else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB)
+   else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == 
PCI_EXP_LNKCAP_SLS_8_0GB)
return PCIE_SPEED_8_0GT;
-   else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB)
+   else if ((lnkcap & PCI_EXP_LNKCAP_SLS) 
==PCI_EXP_LNKCAP_SLS_5_0GB)
return PCIE_SPEED_5_0GT;
-   else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB)
+   else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == 
PCI_EXP_LNKCAP_SLS_2_5GB)
return PCIE_SPEED_2_5GT;
}
 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: amdgpu: [powerplay] failed to send message 148 ret is 0

2018-10-30 Thread Mikulas Patocka


On Mon, 29 Oct 2018, Alex Deucher wrote:

> On Thu, Oct 25, 2018 at 4:46 PM Mikulas Patocka  wrote:
> >
> >
> >
> > On Wed, 24 Oct 2018, Mikulas Patocka wrote:
> >
> > > Hi
> > >
> > > I have a Sapphire Pulse RX 570 ITX graphics card.
> > >
> > > On Linux, I get errors "amdgpu: [powerplay] failed to send message 148 ret
> > > is 0" and the system is stuck for several seconds when they happen. The
> > > card works, except for these errors and occasional delays.
> >
> > I've found that PP_PCIE_DPM_MASK causes there errors. If I turn this bit
> > off in amdgpu.ppfeaturemask, there are no more any errors. (and turning it
> > off also fixes hibernation problems)
> >
> > Should it be turned off automatically in response to these errors?
> 
> What platform are you running on?  Are you running in a VM?  The
> driver accesses pci config space on the bridge to determine the pcie
> gen and lane caps of the platform to determine what clocks and lanes
> are valid.  See amdgpu_device_get_pcie_info().  It would be good to
> figure out why this is not working on your platform.
> 
> Alex

It's not a VM. It's an old motherboard with dual socket F. It has HT2000 
north bridge and HT1000 south bridge. It has two PCIe-v1 8-lane slots.

I've found the bug - pcie_get_speed_cap incorrectly tests the lnkcap 
variable against values that are not bit-masks, so that the PCIe port is 
incorrectly reported as 8GB/s capable. When I fix these tests, the errors 
are gone.

Mikulas
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: amdgpu: [powerplay] failed to send message 148 ret is 0

2018-10-25 Thread Mikulas Patocka


On Wed, 24 Oct 2018, Mikulas Patocka wrote:

> Hi
> 
> I have a Sapphire Pulse RX 570 ITX graphics card.
> 
> On Linux, I get errors "amdgpu: [powerplay] failed to send message 148 ret 
> is 0" and the system is stuck for several seconds when they happen. The 
> card works, except for these errors and occasional delays.

I've found that PP_PCIE_DPM_MASK causes there errors. If I turn this bit 
off in amdgpu.ppfeaturemask, there are no more any errors. (and turning it 
off also fixes hibernation problems)

Should it be turned off automatically in response to these errors?

Mikulas
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


amdgpu: [powerplay] failed to send message 148 ret is 0

2018-10-24 Thread Mikulas Patocka
Hi

I have a Sapphire Pulse RX 570 ITX graphics card.

On Linux, I get errors "amdgpu: [powerplay] failed to send message 148 ret 
is 0" and the system is stuck for several seconds when they happen. The 
card works, except for these errors and occasional delays.

Do you have an idea what could cause these errors or how to debug them?

There's nothing to bisect because all the kernels that I tried (back to 
4.9) show these errors. I've also tried a kernel from branch 
"origin/amd-staging-drm-next" from amdgpu git, but it has even more of 
these errors than 4.18.16.

I tried newer firmware from 
git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git, 
but it didn't help.

Some users suggest that BIOS upgrade may help with this, but there's no 
BIOS for this card on the Sapphire website.

Mikulas


[9.371716] [drm] amdgpu kernel modesetting enabled.
[9.372068] [drm] initializing kernel modesetting (POLARIS10 0x1002:0x67DF 
0x1DA2:0xE343 0xEF).
[9.372126] [drm] register mmio base: 0xFF5C
[9.372158] [drm] register mmio size: 262144
[9.372194] [drm] probing mlw for device 1166:132 = 3026c81
[9.372228] [drm] add ip block number 0 
[9.372260] [drm] add ip block number 1 
[9.372292] [drm] add ip block number 2 
[9.372324] [drm] add ip block number 3 
[9.372356] [drm] add ip block number 4 
[9.372387] [drm] add ip block number 5 
[9.372419] [drm] add ip block number 6 
[9.372452] [drm] add ip block number 7 
[9.372483] [drm] add ip block number 8 
[9.372530] [drm] UVD is enabled in VM mode
[9.372561] [drm] UVD ENC is enabled in VM mode
[9.372594] [drm] VCE enabled in VM mode
[9.372807] amdgpu :07:00.0: Invalid PCI ROM header signature: expecting 
0xaa55, got 0x
[9.373681] ATOM BIOS: 113-D00034-L01
[9.373751] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment 
size is 9-bit
[9.373848] amdgpu :07:00.0: VRAM: 4096M 0x00F4 - 
0x00F4 (4096M used)
[9.373894] amdgpu :07:00.0: GTT: 256M 0x - 
0x0FFF
[9.373941] [drm] Detected VRAM RAM=4096M, BAR=256M
[9.373974] [drm] RAM width 256bits GDDR5
[9.374090] [TTM] Zone  kernel: Available graphics memory: 66051588 kiB
[9.374124] [TTM] Zone   dma32: Available graphics memory: 2097152 kiB
[9.374158] [TTM] Initializing pool allocator
[9.374193] [TTM] Initializing DMA pool allocator
[9.374258] [drm] amdgpu: 4096M of VRAM memory ready
[9.374291] [drm] amdgpu: 4096M of GTT memory ready.
[9.374331] [drm] GART: num cpu pages 65536, num gpu pages 65536
[9.374419] [drm] PCIE GART of 256M enabled (table at 0x00F40090).
[9.374616] [drm] Chained IB support enabled!
[9.376667] [drm] Found UVD firmware Version: 1.130 Family ID: 16
[9.379218] [drm] Found VCE firmware Version: 53.26 Binary ID: 3
[9.433581] [drm] DM_PPLIB: values for Engine clock
[9.433618] [drm] DM_PPLIB:   3
[9.433649] [drm] DM_PPLIB:   58800
[9.433679] [drm] DM_PPLIB:   95200
[9.433710] [drm] DM_PPLIB:   104100
[9.433740] [drm] DM_PPLIB:   110600
[9.433771] [drm] DM_PPLIB:   116800
[9.433801] [drm] DM_PPLIB:   120900
[9.433831] [drm] DM_PPLIB:   124400
[9.433862] [drm] DM_PPLIB: Validation clocks:
[9.433894] [drm] DM_PPLIB:engine_max_clock: 124400
[9.433926] [drm] DM_PPLIB:memory_max_clock: 15
[9.433958] [drm] DM_PPLIB:level   : 8
[9.433990] [drm] DM_PPLIB: values for Memory clock
[9.434026] [drm] DM_PPLIB:   3
[9.434056] [drm] DM_PPLIB:   10
[9.434087] [drm] DM_PPLIB:   15
[9.434117] [drm] DM_PPLIB: Validation clocks:
[9.434148] [drm] DM_PPLIB:engine_max_clock: 124400
[9.434180] [drm] DM_PPLIB:memory_max_clock: 15
[9.434212] [drm] DM_PPLIB:level   : 8
[9.434662] [drm] Display Core initialized with v3.1.44!
[9.447631] [drm] SADs count is: -2, don't need to read it
[9.447676] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[9.447710] [drm] Driver supports precise vblank timestamp query.
[9.471140] random: crng init done
[9.471202] random: 7 urandom warning(s) missed due to ratelimiting
[9.496908] [drm] UVD and UVD ENC initialized successfully.
[9.607867] [drm] VCE initialized successfully.
[9.609791] [drm] fb mappable at 0xC0E28000
[9.609825] [drm] vram apper at 0xC000
[9.609856] [drm] size 8294400
[9.609887] [drm] fb depth is 24
[9.609917] [drm]pitch is 7680
[9.610027] fbcon: amdgpudrmfb (fb0) is primary device
[9.650493] Console: switching to colour frame buffer device 240x67
[9.667224] amdgpu :07:00.0: fb0: amdgpudrmfb frame buffer device
[   10.083684] amdgpu: [powerplay]
failed to send message 148 ret is 0
[   10.904841] amdgpu: [powerplay]
last message was failed ret is 0
[   11.315428] amdgpu: [powerplay]