[Bug 204609] amdgpu: powerplay failed send message
https://bugzilla.kernel.org/show_bug.cgi?id=204609 --- Comment #10 from w...@hllmnn.de --- Created attachment 290911 --> https://bugzilla.kernel.org/attachment.cgi?id=290911&action=edit journalctl output Adding journalctl output -- You are receiving this mail because: You are watching the assignee of the bug. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 204609] amdgpu: powerplay failed send message
https://bugzilla.kernel.org/show_bug.cgi?id=204609 w...@hllmnn.de changed: What|Removed |Added CC||w...@hllmnn.de --- Comment #9 from w...@hllmnn.de --- I can reproduce this bug with kernel 5.7.0 from Debian bullseye (5.7.0-2-amd64) when using a dual monitor setup. If I boot with only one monitor connected and connect the second one later, everything seems to work fine. Both monitors are connected via DisplayPort, but it does not change anything if I connect one of them via HDMI. The significant errors in dmesg read: Aug 15 17:15:03 dino kernel: failed send message: TransferTableSmu2Dram (18) param: 0x0006 response 0xffc2 Aug 15 17:15:03 dino kernel: Failed to export SMU metrics table! Aug 15 17:15:05 dino kernel: Msg issuing pre-check failed and SMU may be not in the right state! Aug 15 17:15:05 dino kernel: [drm:amdgpu_dpm_enable_uvd [amdgpu]] *ERROR* Dpm enable uvd failed, ret = -62. Aug 15 17:15:06 dino kernel: amdgpu :09:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on vcn_enc0 (-110). Aug 15 17:15:07 dino kernel: amdgpu :09:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on vcn_enc1 (-110). Aug 15 17:15:08 dino kernel: Msg issuing pre-check failed and SMU may be not in the right state! Aug 15 17:15:08 dino kernel: Failed to export SMU metrics table! Aug 15 17:15:11 dino kernel: Msg issuing pre-check failed and SMU may be not in the right state! Aug 15 17:15:11 dino kernel: [drm:jpeg_v2_0_set_powergating_state [amdgpu]] *ERROR* Dpm enable jpeg failed, ret = -62. Aug 15 17:15:11 dino kernel: [drm:process_one_work] *ERROR* ib ring test failed (-110). I am using the firmware files from that commit: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=49e9ea898f870ae09b91ccd3dd1c45d520fcb0c3 (commit date: 2020-08-07) -- You are receiving this mail because: You are watching the assignee of the bug. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 204609] amdgpu: powerplay failed send message
https://bugzilla.kernel.org/show_bug.cgi?id=204609 --- Comment #8 from Mikhail Tuchkov (tuchkov.mikh...@gmail.com) --- Created attachment 290075 --> https://bugzilla.kernel.org/attachment.cgi?id=290075&action=edit dmesg -- You are receiving this mail because: You are watching the assignee of the bug. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 204609] amdgpu: powerplay failed send message
https://bugzilla.kernel.org/show_bug.cgi?id=204609 Mikhail Tuchkov (tuchkov.mikh...@gmail.com) changed: What|Removed |Added CC||tuchkov.mikh...@gmail.com --- Comment #7 from Mikhail Tuchkov (tuchkov.mikh...@gmail.com) --- Same hardware: Sapphire RX 5700 and X570 and similar issues with GPU. Attaching dmesg output -- You are receiving this mail because: You are watching the assignee of the bug. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 204609] amdgpu: powerplay failed send message
https://bugzilla.kernel.org/show_bug.cgi?id=204609 Janpieter Sollie (janpieter.sol...@edpnet.be) changed: What|Removed |Added CC||janpieter.sol...@edpnet.be --- Comment #6 from Janpieter Sollie (janpieter.sol...@edpnet.be) --- I hope this is the right place to add some comments: Using 5.5.14, when the GPU runs at full speed, it seems to overheat after some time. I'd expect powerplay to take care of this critical temperature and slow down the GPU. instead, the GPU overheats and a system reset is neccessary. The log is flooded with: Apr 11 06:45:54 linuxserver kernel: amdgpu :05:00.0: GPU reset begin! Apr 11 06:45:54 linuxserver kernel: [drm] Timeout wait for RLC serdes 0,0 Apr 11 06:45:56 linuxserver last message repeated 4 times Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: last message was failed ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: failed to send message 133 ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: last message was failed ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: failed to send message 5e ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: last message was failed ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: failed to send message 145 ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: last message was failed ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: failed to send message 146 ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: last message was failed ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: failed to send message 148 ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: last message was failed ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: failed to send message 145 ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: last message was failed ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: failed to send message 146 ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: last message was failed ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: failed to send message 16a ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: last message was failed ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: failed to send message 186 ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: last message was failed ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: failed to send message 54 ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: last message was failed ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Apr 11 06:45:57 linuxserver kernel: failed to send message 13d ret is 65535 Apr 11 06:45:57 linuxserver kernel: amdgpu: [powerplay] Then after some time, the kernel realizes it did something wrong, but it seems too late: [ 509.939497] amdgpu: [powerplay] Failed to force to switch arbf0! [ 509.939497] amdgpu: [powerplay] [disable_dpm_tasks] Failed to disable DPM! [ 509.939513] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block failed -22 [ 510.203234] amdgpu :05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110) [ 510.203254] [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed [ 510.313006] AMD-Vi: Completion-Wait loop timed out [ 510.730195] cp is busy, skip halt cp [ 510.744990] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=05:00.0 address=0x5f38cdcf0] [ 510.993797] rlc is busy, skip halt rlc [ 511.312957] AMD-Vi: Completion-Wait loop timed out [ 511.522695] AMD-Vi: Completion-Wait loop timed out [ 511.742668] AMD-Vi: Completion-Wait loop timed out [ 511.746867] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=05:00.0 address=0x5f38cdd20] [ 512.748750] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=05:00
[Bug 204609] amdgpu: powerplay failed send message
https://bugzilla.kernel.org/show_bug.cgi?id=204609 Błażej Szczygieł (spa...@wp.pl) changed: What|Removed |Added CC||spa...@wp.pl --- Comment #5 from Błażej Szczygieł (spa...@wp.pl) --- Created attachment 286329 --> https://bugzilla.kernel.org/attachment.cgi?id=286329&action=edit kernel5.4 powerplay add mutex Also the same as: https://gitlab.freedesktop.org/drm/amd/issues/900 In kernel 5.4.3 run a few times `watch -n 0.1 cat /sys/class/drm/card0/device/hwmon/hwmon*/fan1_input` in parallel and you'll have a lot of warning and finally system freeze. In kernel 5.5-rc2 powerplay warnings and finally system freeze can happen when sensors are read and screen mode is changed (like screen resolution change, etc). In my opinion a lot of powerplay issues are caused by missing mutexes between kernel and user space. I prepared a very simple kernel 5.4 patch which allows me to use fan monitor, but probably there are still some missing mutexes there. -- You are receiving this mail because: You are watching the assignee of the bug. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 204609] amdgpu: powerplay failed send message
https://bugzilla.kernel.org/show_bug.cgi?id=204609 Jon (j...@moozaad.co.uk) changed: What|Removed |Added CC||j...@moozaad.co.uk --- Comment #4 from Jon (j...@moozaad.co.uk) --- Likely the same as this... https://gitlab.freedesktop.org/drm/amd/issues/929 Multiple Display ports. If you have flickering too (not sure if it's all related) there's a dpm override you can try. `echo "low" > /sys/class/drm/card0/device/power_dpm_force_performance_level` -- You are receiving this mail because: You are watching the assignee of the bug. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 204609] amdgpu: powerplay failed send message
https://bugzilla.kernel.org/show_bug.cgi?id=204609 --- Comment #3 from Guido Winkelmann (guido-kern-b...@unknownsite.de) --- Created attachment 285867 --> https://bugzilla.kernel.org/attachment.cgi?id=285867&action=edit dmesg output from another system -- You are receiving this mail because: You are watching the assignee of the bug. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 204609] amdgpu: powerplay failed send message
https://bugzilla.kernel.org/show_bug.cgi?id=204609 Guido Winkelmann (guido-kern-b...@unknownsite.de) changed: What|Removed |Added CC||guido-kern-bugs@unknownsite ||.de --- Comment #2 from Guido Winkelmann (guido-kern-b...@unknownsite.de) --- I have very similar problems with Kernel 5.4-rc7. In my case it's a Sapphire 5700XT, and I am using Gentoo instead of Ubuntu, however, the problems start way before any userspace code is loaded. In my case, the problems cause long delays on system boot, once before loading init, once while X is loading. As a very rough estimate, I think the delays amount to about 2 minutes. Once X is loaded, I can no longer see any delays. All sensors on the board are completely non-functional. Here is a sample output from "sensors": == amdgpu-pci-0a00 Adapter: PCI adapter vddgfx: +0.75 V ERROR: Can't get value of subfeature fan1_input: I/O error fan1: N/A (min =0 RPM, max = 3200 RPM) ERROR: Can't get value of subfeature temp1_input: I/O error edge: N/A (crit = +118.0°C, hyst = -273.1°C) (emerg = +99.0°C) ERROR: Can't get value of subfeature temp2_input: I/O error junction: N/A (crit = +99.0°C, hyst = -273.1°C) (emerg = +99.0°C) ERROR: Can't get value of subfeature temp3_input: I/O error mem: N/A (crit = +99.0°C, hyst = -273.1°C) (emerg = +99.0°C) ERROR: Can't get value of subfeature power1_average: I/O error power1: N/A (cap = 195.00 W) k10temp-pci-00c3 Adapter: PCI adapter Tdie: +31.5°C (high = +70.0°C) Tctl: +41.5°C == I did not experience any of these problems with 5.3.10. I have three monitors connected to this board, all of them via DisplayPort. I am attaching another dmesg log. -- You are receiving this mail because: You are watching the assignee of the bug. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 204609] amdgpu: powerplay failed send message
https://bugzilla.kernel.org/show_bug.cgi?id=204609 Witold Baryluk (bary...@smp.if.uj.edu.pl) changed: What|Removed |Added CC||bary...@smp.if.uj.edu.pl --- Comment #1 from Witold Baryluk (bary...@smp.if.uj.edu.pl) --- Do you use two displays by any chance? -- You are receiving this mail because: You are watching the assignee of the bug. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel