Re: amdgpu: [powerplay] failed to send message 148 ret is 0

2018-10-30 Thread Alex Deucher
Nice work.  Thanks for tracking this down!

Alex
On Tue, Oct 30, 2018 at 12:32 PM Mikulas Patocka  wrote:
>
>
>
> On Mon, 29 Oct 2018, Alex Deucher wrote:
>
> > On Thu, Oct 25, 2018 at 4:46 PM Mikulas Patocka  wrote:
> > >
> > >
> > >
> > > On Wed, 24 Oct 2018, Mikulas Patocka wrote:
> > >
> > > > Hi
> > > >
> > > > I have a Sapphire Pulse RX 570 ITX graphics card.
> > > >
> > > > On Linux, I get errors "amdgpu: [powerplay] failed to send message 148 
> > > > ret
> > > > is 0" and the system is stuck for several seconds when they happen. The
> > > > card works, except for these errors and occasional delays.
> > >
> > > I've found that PP_PCIE_DPM_MASK causes there errors. If I turn this bit
> > > off in amdgpu.ppfeaturemask, there are no more any errors. (and turning it
> > > off also fixes hibernation problems)
> > >
> > > Should it be turned off automatically in response to these errors?
> >
> > What platform are you running on?  Are you running in a VM?  The
> > driver accesses pci config space on the bridge to determine the pcie
> > gen and lane caps of the platform to determine what clocks and lanes
> > are valid.  See amdgpu_device_get_pcie_info().  It would be good to
> > figure out why this is not working on your platform.
> >
> > Alex
>
> It's not a VM. It's an old motherboard with dual socket F. It has HT2000
> north bridge and HT1000 south bridge. It has two PCIe-v1 8-lane slots.
>
> I've found the bug - pcie_get_speed_cap incorrectly tests the lnkcap
> variable against values that are not bit-masks, so that the PCIe port is
> incorrectly reported as 8GB/s capable. When I fix these tests, the errors
> are gone.
>
> Mikulas
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: amdgpu: [powerplay] failed to send message 148 ret is 0

2018-10-30 Thread Mikulas Patocka


On Mon, 29 Oct 2018, Alex Deucher wrote:

> On Thu, Oct 25, 2018 at 4:46 PM Mikulas Patocka  wrote:
> >
> >
> >
> > On Wed, 24 Oct 2018, Mikulas Patocka wrote:
> >
> > > Hi
> > >
> > > I have a Sapphire Pulse RX 570 ITX graphics card.
> > >
> > > On Linux, I get errors "amdgpu: [powerplay] failed to send message 148 ret
> > > is 0" and the system is stuck for several seconds when they happen. The
> > > card works, except for these errors and occasional delays.
> >
> > I've found that PP_PCIE_DPM_MASK causes there errors. If I turn this bit
> > off in amdgpu.ppfeaturemask, there are no more any errors. (and turning it
> > off also fixes hibernation problems)
> >
> > Should it be turned off automatically in response to these errors?
> 
> What platform are you running on?  Are you running in a VM?  The
> driver accesses pci config space on the bridge to determine the pcie
> gen and lane caps of the platform to determine what clocks and lanes
> are valid.  See amdgpu_device_get_pcie_info().  It would be good to
> figure out why this is not working on your platform.
> 
> Alex

It's not a VM. It's an old motherboard with dual socket F. It has HT2000 
north bridge and HT1000 south bridge. It has two PCIe-v1 8-lane slots.

I've found the bug - pcie_get_speed_cap incorrectly tests the lnkcap 
variable against values that are not bit-masks, so that the PCIe port is 
incorrectly reported as 8GB/s capable. When I fix these tests, the errors 
are gone.

Mikulas
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: amdgpu: [powerplay] failed to send message 148 ret is 0

2018-10-29 Thread Alex Deucher
On Thu, Oct 25, 2018 at 4:46 PM Mikulas Patocka  wrote:
>
>
>
> On Wed, 24 Oct 2018, Mikulas Patocka wrote:
>
> > Hi
> >
> > I have a Sapphire Pulse RX 570 ITX graphics card.
> >
> > On Linux, I get errors "amdgpu: [powerplay] failed to send message 148 ret
> > is 0" and the system is stuck for several seconds when they happen. The
> > card works, except for these errors and occasional delays.
>
> I've found that PP_PCIE_DPM_MASK causes there errors. If I turn this bit
> off in amdgpu.ppfeaturemask, there are no more any errors. (and turning it
> off also fixes hibernation problems)
>
> Should it be turned off automatically in response to these errors?

What platform are you running on?  Are you running in a VM?  The
driver accesses pci config space on the bridge to determine the pcie
gen and lane caps of the platform to determine what clocks and lanes
are valid.  See amdgpu_device_get_pcie_info().  It would be good to
figure out why this is not working on your platform.

Alex
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: amdgpu: [powerplay] failed to send message 148 ret is 0

2018-10-25 Thread Mikulas Patocka


On Wed, 24 Oct 2018, Mikulas Patocka wrote:

> Hi
> 
> I have a Sapphire Pulse RX 570 ITX graphics card.
> 
> On Linux, I get errors "amdgpu: [powerplay] failed to send message 148 ret 
> is 0" and the system is stuck for several seconds when they happen. The 
> card works, except for these errors and occasional delays.

I've found that PP_PCIE_DPM_MASK causes there errors. If I turn this bit 
off in amdgpu.ppfeaturemask, there are no more any errors. (and turning it 
off also fixes hibernation problems)

Should it be turned off automatically in response to these errors?

Mikulas
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


amdgpu: [powerplay] failed to send message 148 ret is 0

2018-10-24 Thread Mikulas Patocka
Hi

I have a Sapphire Pulse RX 570 ITX graphics card.

On Linux, I get errors "amdgpu: [powerplay] failed to send message 148 ret 
is 0" and the system is stuck for several seconds when they happen. The 
card works, except for these errors and occasional delays.

Do you have an idea what could cause these errors or how to debug them?

There's nothing to bisect because all the kernels that I tried (back to 
4.9) show these errors. I've also tried a kernel from branch 
"origin/amd-staging-drm-next" from amdgpu git, but it has even more of 
these errors than 4.18.16.

I tried newer firmware from 
git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git, 
but it didn't help.

Some users suggest that BIOS upgrade may help with this, but there's no 
BIOS for this card on the Sapphire website.

Mikulas


[9.371716] [drm] amdgpu kernel modesetting enabled.
[9.372068] [drm] initializing kernel modesetting (POLARIS10 0x1002:0x67DF 
0x1DA2:0xE343 0xEF).
[9.372126] [drm] register mmio base: 0xFF5C
[9.372158] [drm] register mmio size: 262144
[9.372194] [drm] probing mlw for device 1166:132 = 3026c81
[9.372228] [drm] add ip block number 0 
[9.372260] [drm] add ip block number 1 
[9.372292] [drm] add ip block number 2 
[9.372324] [drm] add ip block number 3 
[9.372356] [drm] add ip block number 4 
[9.372387] [drm] add ip block number 5 
[9.372419] [drm] add ip block number 6 
[9.372452] [drm] add ip block number 7 
[9.372483] [drm] add ip block number 8 
[9.372530] [drm] UVD is enabled in VM mode
[9.372561] [drm] UVD ENC is enabled in VM mode
[9.372594] [drm] VCE enabled in VM mode
[9.372807] amdgpu :07:00.0: Invalid PCI ROM header signature: expecting 
0xaa55, got 0x
[9.373681] ATOM BIOS: 113-D00034-L01
[9.373751] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment 
size is 9-bit
[9.373848] amdgpu :07:00.0: VRAM: 4096M 0x00F4 - 
0x00F4 (4096M used)
[9.373894] amdgpu :07:00.0: GTT: 256M 0x - 
0x0FFF
[9.373941] [drm] Detected VRAM RAM=4096M, BAR=256M
[9.373974] [drm] RAM width 256bits GDDR5
[9.374090] [TTM] Zone  kernel: Available graphics memory: 66051588 kiB
[9.374124] [TTM] Zone   dma32: Available graphics memory: 2097152 kiB
[9.374158] [TTM] Initializing pool allocator
[9.374193] [TTM] Initializing DMA pool allocator
[9.374258] [drm] amdgpu: 4096M of VRAM memory ready
[9.374291] [drm] amdgpu: 4096M of GTT memory ready.
[9.374331] [drm] GART: num cpu pages 65536, num gpu pages 65536
[9.374419] [drm] PCIE GART of 256M enabled (table at 0x00F40090).
[9.374616] [drm] Chained IB support enabled!
[9.376667] [drm] Found UVD firmware Version: 1.130 Family ID: 16
[9.379218] [drm] Found VCE firmware Version: 53.26 Binary ID: 3
[9.433581] [drm] DM_PPLIB: values for Engine clock
[9.433618] [drm] DM_PPLIB:   3
[9.433649] [drm] DM_PPLIB:   58800
[9.433679] [drm] DM_PPLIB:   95200
[9.433710] [drm] DM_PPLIB:   104100
[9.433740] [drm] DM_PPLIB:   110600
[9.433771] [drm] DM_PPLIB:   116800
[9.433801] [drm] DM_PPLIB:   120900
[9.433831] [drm] DM_PPLIB:   124400
[9.433862] [drm] DM_PPLIB: Validation clocks:
[9.433894] [drm] DM_PPLIB:engine_max_clock: 124400
[9.433926] [drm] DM_PPLIB:memory_max_clock: 15
[9.433958] [drm] DM_PPLIB:level   : 8
[9.433990] [drm] DM_PPLIB: values for Memory clock
[9.434026] [drm] DM_PPLIB:   3
[9.434056] [drm] DM_PPLIB:   10
[9.434087] [drm] DM_PPLIB:   15
[9.434117] [drm] DM_PPLIB: Validation clocks:
[9.434148] [drm] DM_PPLIB:engine_max_clock: 124400
[9.434180] [drm] DM_PPLIB:memory_max_clock: 15
[9.434212] [drm] DM_PPLIB:level   : 8
[9.434662] [drm] Display Core initialized with v3.1.44!
[9.447631] [drm] SADs count is: -2, don't need to read it
[9.447676] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[9.447710] [drm] Driver supports precise vblank timestamp query.
[9.471140] random: crng init done
[9.471202] random: 7 urandom warning(s) missed due to ratelimiting
[9.496908] [drm] UVD and UVD ENC initialized successfully.
[9.607867] [drm] VCE initialized successfully.
[9.609791] [drm] fb mappable at 0xC0E28000
[9.609825] [drm] vram apper at 0xC000
[9.609856] [drm] size 8294400
[9.609887] [drm] fb depth is 24
[9.609917] [drm]pitch is 7680
[9.610027] fbcon: amdgpudrmfb (fb0) is primary device
[9.650493] Console: switching to colour frame buffer device 240x67
[9.667224] amdgpu :07:00.0: fb0: amdgpudrmfb frame buffer device
[   10.083684] amdgpu: [powerplay]
    failed to send message 148 ret is 0
[   10.904841] amdgpu: [powerplay]
last message was failed ret is 0
[   11.315428