Re: amdgpu: [powerplay] failed to send message 148 ret is 0
Nice work. Thanks for tracking this down! Alex On Tue, Oct 30, 2018 at 12:32 PM Mikulas Patocka wrote: > > > > On Mon, 29 Oct 2018, Alex Deucher wrote: > > > On Thu, Oct 25, 2018 at 4:46 PM Mikulas Patocka wrote: > > > > > > > > > > > > On Wed, 24 Oct 2018, Mikulas Patocka wrote: > > > > > > > Hi > > > > > > > > I have a Sapphire Pulse RX 570 ITX graphics card. > > > > > > > > On Linux, I get errors "amdgpu: [powerplay] failed to send message 148 > > > > ret > > > > is 0" and the system is stuck for several seconds when they happen. The > > > > card works, except for these errors and occasional delays. > > > > > > I've found that PP_PCIE_DPM_MASK causes there errors. If I turn this bit > > > off in amdgpu.ppfeaturemask, there are no more any errors. (and turning it > > > off also fixes hibernation problems) > > > > > > Should it be turned off automatically in response to these errors? > > > > What platform are you running on? Are you running in a VM? The > > driver accesses pci config space on the bridge to determine the pcie > > gen and lane caps of the platform to determine what clocks and lanes > > are valid. See amdgpu_device_get_pcie_info(). It would be good to > > figure out why this is not working on your platform. > > > > Alex > > It's not a VM. It's an old motherboard with dual socket F. It has HT2000 > north bridge and HT1000 south bridge. It has two PCIe-v1 8-lane slots. > > I've found the bug - pcie_get_speed_cap incorrectly tests the lnkcap > variable against values that are not bit-masks, so that the PCIe port is > incorrectly reported as 8GB/s capable. When I fix these tests, the errors > are gone. > > Mikulas ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: amdgpu: [powerplay] failed to send message 148 ret is 0
On Mon, 29 Oct 2018, Alex Deucher wrote: > On Thu, Oct 25, 2018 at 4:46 PM Mikulas Patocka wrote: > > > > > > > > On Wed, 24 Oct 2018, Mikulas Patocka wrote: > > > > > Hi > > > > > > I have a Sapphire Pulse RX 570 ITX graphics card. > > > > > > On Linux, I get errors "amdgpu: [powerplay] failed to send message 148 ret > > > is 0" and the system is stuck for several seconds when they happen. The > > > card works, except for these errors and occasional delays. > > > > I've found that PP_PCIE_DPM_MASK causes there errors. If I turn this bit > > off in amdgpu.ppfeaturemask, there are no more any errors. (and turning it > > off also fixes hibernation problems) > > > > Should it be turned off automatically in response to these errors? > > What platform are you running on? Are you running in a VM? The > driver accesses pci config space on the bridge to determine the pcie > gen and lane caps of the platform to determine what clocks and lanes > are valid. See amdgpu_device_get_pcie_info(). It would be good to > figure out why this is not working on your platform. > > Alex It's not a VM. It's an old motherboard with dual socket F. It has HT2000 north bridge and HT1000 south bridge. It has two PCIe-v1 8-lane slots. I've found the bug - pcie_get_speed_cap incorrectly tests the lnkcap variable against values that are not bit-masks, so that the PCIe port is incorrectly reported as 8GB/s capable. When I fix these tests, the errors are gone. Mikulas ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: amdgpu: [powerplay] failed to send message 148 ret is 0
On Thu, Oct 25, 2018 at 4:46 PM Mikulas Patocka wrote: > > > > On Wed, 24 Oct 2018, Mikulas Patocka wrote: > > > Hi > > > > I have a Sapphire Pulse RX 570 ITX graphics card. > > > > On Linux, I get errors "amdgpu: [powerplay] failed to send message 148 ret > > is 0" and the system is stuck for several seconds when they happen. The > > card works, except for these errors and occasional delays. > > I've found that PP_PCIE_DPM_MASK causes there errors. If I turn this bit > off in amdgpu.ppfeaturemask, there are no more any errors. (and turning it > off also fixes hibernation problems) > > Should it be turned off automatically in response to these errors? What platform are you running on? Are you running in a VM? The driver accesses pci config space on the bridge to determine the pcie gen and lane caps of the platform to determine what clocks and lanes are valid. See amdgpu_device_get_pcie_info(). It would be good to figure out why this is not working on your platform. Alex ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: amdgpu: [powerplay] failed to send message 148 ret is 0
On Wed, 24 Oct 2018, Mikulas Patocka wrote: > Hi > > I have a Sapphire Pulse RX 570 ITX graphics card. > > On Linux, I get errors "amdgpu: [powerplay] failed to send message 148 ret > is 0" and the system is stuck for several seconds when they happen. The > card works, except for these errors and occasional delays. I've found that PP_PCIE_DPM_MASK causes there errors. If I turn this bit off in amdgpu.ppfeaturemask, there are no more any errors. (and turning it off also fixes hibernation problems) Should it be turned off automatically in response to these errors? Mikulas ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
amdgpu: [powerplay] failed to send message 148 ret is 0
Hi I have a Sapphire Pulse RX 570 ITX graphics card. On Linux, I get errors "amdgpu: [powerplay] failed to send message 148 ret is 0" and the system is stuck for several seconds when they happen. The card works, except for these errors and occasional delays. Do you have an idea what could cause these errors or how to debug them? There's nothing to bisect because all the kernels that I tried (back to 4.9) show these errors. I've also tried a kernel from branch "origin/amd-staging-drm-next" from amdgpu git, but it has even more of these errors than 4.18.16. I tried newer firmware from git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git, but it didn't help. Some users suggest that BIOS upgrade may help with this, but there's no BIOS for this card on the Sapphire website. Mikulas [9.371716] [drm] amdgpu kernel modesetting enabled. [9.372068] [drm] initializing kernel modesetting (POLARIS10 0x1002:0x67DF 0x1DA2:0xE343 0xEF). [9.372126] [drm] register mmio base: 0xFF5C [9.372158] [drm] register mmio size: 262144 [9.372194] [drm] probing mlw for device 1166:132 = 3026c81 [9.372228] [drm] add ip block number 0 [9.372260] [drm] add ip block number 1 [9.372292] [drm] add ip block number 2 [9.372324] [drm] add ip block number 3 [9.372356] [drm] add ip block number 4 [9.372387] [drm] add ip block number 5 [9.372419] [drm] add ip block number 6 [9.372452] [drm] add ip block number 7 [9.372483] [drm] add ip block number 8 [9.372530] [drm] UVD is enabled in VM mode [9.372561] [drm] UVD ENC is enabled in VM mode [9.372594] [drm] VCE enabled in VM mode [9.372807] amdgpu :07:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0x [9.373681] ATOM BIOS: 113-D00034-L01 [9.373751] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit [9.373848] amdgpu :07:00.0: VRAM: 4096M 0x00F4 - 0x00F4 (4096M used) [9.373894] amdgpu :07:00.0: GTT: 256M 0x - 0x0FFF [9.373941] [drm] Detected VRAM RAM=4096M, BAR=256M [9.373974] [drm] RAM width 256bits GDDR5 [9.374090] [TTM] Zone kernel: Available graphics memory: 66051588 kiB [9.374124] [TTM] Zone dma32: Available graphics memory: 2097152 kiB [9.374158] [TTM] Initializing pool allocator [9.374193] [TTM] Initializing DMA pool allocator [9.374258] [drm] amdgpu: 4096M of VRAM memory ready [9.374291] [drm] amdgpu: 4096M of GTT memory ready. [9.374331] [drm] GART: num cpu pages 65536, num gpu pages 65536 [9.374419] [drm] PCIE GART of 256M enabled (table at 0x00F40090). [9.374616] [drm] Chained IB support enabled! [9.376667] [drm] Found UVD firmware Version: 1.130 Family ID: 16 [9.379218] [drm] Found VCE firmware Version: 53.26 Binary ID: 3 [9.433581] [drm] DM_PPLIB: values for Engine clock [9.433618] [drm] DM_PPLIB: 3 [9.433649] [drm] DM_PPLIB: 58800 [9.433679] [drm] DM_PPLIB: 95200 [9.433710] [drm] DM_PPLIB: 104100 [9.433740] [drm] DM_PPLIB: 110600 [9.433771] [drm] DM_PPLIB: 116800 [9.433801] [drm] DM_PPLIB: 120900 [9.433831] [drm] DM_PPLIB: 124400 [9.433862] [drm] DM_PPLIB: Validation clocks: [9.433894] [drm] DM_PPLIB:engine_max_clock: 124400 [9.433926] [drm] DM_PPLIB:memory_max_clock: 15 [9.433958] [drm] DM_PPLIB:level : 8 [9.433990] [drm] DM_PPLIB: values for Memory clock [9.434026] [drm] DM_PPLIB: 3 [9.434056] [drm] DM_PPLIB: 10 [9.434087] [drm] DM_PPLIB: 15 [9.434117] [drm] DM_PPLIB: Validation clocks: [9.434148] [drm] DM_PPLIB:engine_max_clock: 124400 [9.434180] [drm] DM_PPLIB:memory_max_clock: 15 [9.434212] [drm] DM_PPLIB:level : 8 [9.434662] [drm] Display Core initialized with v3.1.44! [9.447631] [drm] SADs count is: -2, don't need to read it [9.447676] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [9.447710] [drm] Driver supports precise vblank timestamp query. [9.471140] random: crng init done [9.471202] random: 7 urandom warning(s) missed due to ratelimiting [9.496908] [drm] UVD and UVD ENC initialized successfully. [9.607867] [drm] VCE initialized successfully. [9.609791] [drm] fb mappable at 0xC0E28000 [9.609825] [drm] vram apper at 0xC000 [9.609856] [drm] size 8294400 [9.609887] [drm] fb depth is 24 [9.609917] [drm]pitch is 7680 [9.610027] fbcon: amdgpudrmfb (fb0) is primary device [9.650493] Console: switching to colour frame buffer device 240x67 [9.667224] amdgpu :07:00.0: fb0: amdgpudrmfb frame buffer device [ 10.083684] amdgpu: [powerplay] failed to send message 148 ret is 0 [ 10.904841] amdgpu: [powerplay] last message was failed ret is 0 [ 11.315428