[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #172 from line...@xcpp.org --- I had dpm=2 as a module option. GPU initialization failure does not occur without dpm=2 -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 Alex Deucher changed: What|Removed |Added Attachment #146026|text/x-log |text/plain mime type|| -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #171 from line...@xcpp.org --- Created attachment 146026 --> https://bugs.freedesktop.org/attachment.cgi?id=146026=edit 5.4.0-arch1-1 GPU initialization fails With kernel version 5.4.0-arch1-1 the GPU can flat out no longer be initialized. My system is now completely unusable with the current kernel. Does this specifically mean anything? [ 15.575361] amdgpu: [powerplay] smu driver if version = 0x0013, smu fw if version = 0x0012, smu fw version = 0x00282d00 (40.45.0) [ 15.575362] amdgpu: [powerplay] SMU driver if version not matched -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #170 from Peter Hercek --- Maybe this helps since there is a stack trace. GUI stopped to respond so I shut it down over ssh. A kernel crash during the shutdown on 5.3.6-arch1-1-ARCH even when amdgpu.dpm=0. That is the option which is supposed to work. It has both the patch and also amdgpu.dpm=0. Nov 04 17:38:58 phnm kernel: [ cut here ] Nov 04 17:38:58 phnm kernel: WARNING: CPU: 6 PID: 640 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:5804 amdgpu_dm_atomic_commit_tail.cold+0x82/0xed [amdgpu] Nov 04 17:38:58 phnm kernel: Modules linked in: fuse xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter tun bridge cfg80211 rfkill 8021q garp mrp stp llc intel_rapl_msr intel_rapl_common amdgpu x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel hid_microsoft radeon mousedev input_leds joydev ff_memless kvm gpu_sched snd_hda_codec_realtek snd_hda_codec_generic i2c_algo_bit irqbypass ledtrig_audio ttm crct10dif_pclmul snd_hda_intel crc32_pclmul hid_generic ghash_clmulni_intel cdc_acm drm_kms_helper snd_hda_codec aesni_intel usbhid iTCO_wdt iTCO_vendor_support snd_hda_core wmi_bmof aes_x86_64 hid crypto_simd cryptd mxm_wmi snd_hwdep glue_helper drm intel_cstate snd_pcm agpgart r8169 syscopyarea intel_uncore sysfillrect realtek sysimgblt snd_timer pcspkr i2c_i801 fb_sys_fops e1000e intel_rapl_perf Nov 04 17:38:58 phnm kernel: mei_me snd libphy mei soundcore lpc_ich wmi evdev mac_hid sg ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 crc32c_intel firewire_ohci xhci_pci xhci_hcd firewire_core ehci_pci crc_itu_t ehci_hcd sr_mod cdrom sd_mod ahci libahci libata scsi_mod Nov 04 17:38:58 phnm kernel: CPU: 6 PID: 640 Comm: Xorg Not tainted 5.3.6-arch1-1-ARCH #1 Nov 04 17:38:58 phnm kernel: Hardware name: System manufacturer System Product Name/P9X79, BIOS 4502 10/15/2013 Nov 04 17:38:58 phnm kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail.cold+0x82/0xed [amdgpu] Nov 04 17:38:58 phnm kernel: Code: c7 c7 08 1e db c0 e8 0f 59 a0 db 0f 0b 41 83 7c 24 08 00 0f 85 92 ff f1 ff e9 ad ff f1 ff 48 c7 c7 08 1e db c0 e8 f0 58 a0 db <0f> 0b e9 32 f5 f1 ff 48 8b 85 00 fd ff ff 4c 89 f2 48 c7 c6 0d 0f Nov 04 17:38:58 phnm kernel: RSP: 0018:a98c410475a0 EFLAGS: 00010046 Nov 04 17:38:58 phnm kernel: RAX: 0024 RBX: 894125e06000 RCX: Nov 04 17:38:58 phnm kernel: RDX: RSI: 0003 RDI: Nov 04 17:38:58 phnm kernel: RBP: a98c410478c0 R08: 16b622fb648e R09: 9deb3254 Nov 04 17:38:58 phnm kernel: R10: 0616 R11: 0001d890 R12: 0286 Nov 04 17:38:58 phnm kernel: R13: 8940f30b0400 R14: 894129c2 R15: 894075ba6a00 Nov 04 17:38:58 phnm kernel: FS: 7fbf9c35c500() GS:89413fb8() knlGS: Nov 04 17:38:58 phnm kernel: CS: 0010 DS: ES: CR0: 80050033 Nov 04 17:38:58 phnm kernel: CR2: 559991d31420 CR3: 00082a644002 CR4: 000606e0 Nov 04 17:38:58 phnm kernel: Call Trace: Nov 04 17:38:58 phnm kernel: ? commit_tail+0x3c/0x70 [drm_kms_helper] Nov 04 17:38:58 phnm kernel: commit_tail+0x3c/0x70 [drm_kms_helper] Nov 04 17:38:58 phnm kernel: drm_atomic_helper_commit+0x108/0x110 [drm_kms_helper] Nov 04 17:38:58 phnm kernel: drm_client_modeset_commit_atomic+0x1e8/0x200 [drm] Nov 04 17:38:58 phnm kernel: drm_client_modeset_commit_force+0x50/0x150 [drm] Nov 04 17:38:58 phnm kernel: drm_fb_helper_pan_display+0xc2/0x200 [drm_kms_helper] Nov 04 17:38:58 phnm kernel: fb_pan_display+0x83/0x100 Nov 04 17:38:58 phnm kernel: fb_set_var+0x1e8/0x3d0 Nov 04 17:38:58 phnm kernel: fbcon_blank+0x1dd/0x290 Nov 04 17:38:58 phnm kernel: do_unblank_screen+0x98/0x130 Nov 04 17:38:58 phnm kernel: vt_ioctl+0xeff/0x1290 Nov 04 17:38:58 phnm kernel: tty_ioctl+0x37b/0x900 Nov 04 17:38:58 phnm kernel: ? preempt_count_add+0x68/0xa0 Nov 04 17:38:58 phnm kernel: do_vfs_ioctl+0x43d/0x6c0 Nov 04 17:38:58 phnm kernel: ? syscall_trace_enter+0x1f2/0x2e0 Nov 04 17:38:58 phnm kernel: ksys_ioctl+0x5e/0x90 Nov 04 17:38:58 phnm kernel: __x64_sys_ioctl+0x16/0x20 Nov 04 17:38:58 phnm kernel: do_syscall_64+0x5f/0x1c0 Nov 04 17:38:58 phnm kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Nov 04 17:38:58 phnm kernel: RIP: 0033:0x7fbf9d7b425b Nov 04 17:38:58 phnm kernel: Code: 0f 1e fa 48 8b 05 25 9c 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 9b 0c 00 f7 d8 64 89 01 48 Nov 04 17:38:58 phnm kernel: RSP: 002b:7ffe21162798 EFLAGS: 0246 ORIG_RAX: 0010 Nov 04 17:38:58 phnm kernel: RAX: ffda RBX: 55d93ebf5180 RCX: 7fbf9d7b425b Nov 04 17:38:58
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #169 from picar...@live.de --- I am using a Radeon VII with Arch Linux, a 1440p144hz and a 4K60Hz monitor, and I had similar crashes to the others here if I tried running the 1440p144hz monitor at 144hz, at 60hz it was stable. This behavior stayed all the way from kernel 5.0 up to 5.3, and only stopped when I started using kernel 5.4.0 (5.4.0-rc6-mainline right now). Now I can run it at 144hz without crashes. The driver still isn't working that well, as games seem very stuttery, but at least it doesn't crash anymore. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #168 from line...@xcpp.org --- Created attachment 145784 --> https://bugs.freedesktop.org/attachment.cgi?id=145784=edit 5.3.7: Fence fallback timer expired on ring Here is a freeze which went a bit differently. This time the system is frozen without any blinking and there are tons of messages like: [ 2940.919451] [drm] Fence fallback timer expired on ring page1 This is on 5.3.7-arch1-1 (Also I'm using only one single monitor connected through DP, as opposed to the others) -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #167 from Alex Deucher --- (In reply to Peter Hercek from comment #166) > I got the crash after 4 days of use. It looks the same as before: > ring sdma0 timeout, gpu reset (allegedly successful), many skipped IBs, and > failure to initialize parser for ever. The parser error just means you need to restart your desktop environment. At the moment no desktop managers properly handle GPU resets (recreate their context and buffers) so you need to restart your desktop to get it back. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #166 from Peter Hercek --- I tried, 5.3.6-arch1-1 on archlinux with 3 DP monitors. It should contain the patch based on the comment from line...@xcpp.org. I got the crash after 4 days of use. It looks the same as before: ring sdma0 timeout, gpu reset (allegedly successful), many skipped IBs, and failure to initialize parser for ever. The situation looked like this from my experience: with each new kernel the error got worse and worse; 5.3.6 improved it a lot, but it is still not fixed. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #165 from Tom B --- I just tried 5.3.5 (which is the latest in the arch repo) and it's working fine for me. I do have an issue on Wayland. If the screen turns off, Wayland crashes and I have to hard reset. The log shows Oct 14 17:48:56 desktop kernel: amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed! Oct 14 17:49:02 desktop kernel: amdgpu: [powerplay] Failed to send message 0x26, response 0x0 Oct 14 17:49:02 desktop kernel: amdgpu: [powerplay] Failed to set soft min gfxclk ! Oct 14 17:49:02 desktop kernel: amdgpu: [powerplay] Failed to upload DPM Bootup Levels! But, this also shows on boot so I'm not sure it's a problem and it seems to be wayland that segfaults, not an issue with amdgpu. I do still get `kernel: [drm] schedsdma0 is not ready, skipping` repeating forever in my journal. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #164 from line...@xcpp.org --- (In reply to Tom B from comment #163) > Gargoyle, linedot, can you confirm whether this crash is with both patches > applied? > > I'm still on 5.3.1 patched and haven't had a single crash. For 5.3.1 I've built the kernel with the arch build system and manually added lines to apply the two patches to PKGBUILD and also have seen them being applied in the log. For 5.3.6 I've checked that the patches are already applied. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #163 from Tom B --- Gargoyle, linedot, can you confirm whether this crash is with both patches applied? I'm still on 5.3.1 patched and haven't had a single crash. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #162 from line...@xcpp.org --- Created attachment 145730 --> https://bugs.freedesktop.org/attachment.cgi?id=145730=edit Freeze/Black screen/Crash on 5.3.6 Apologies, I have been on vacation and thus away from my main System. Attached is the dmesg log of another crash with kernel version 5.3.6. Here is a description of what the crash looked like: 1) Successfully booted up to login manager 2) Logged into a graphical session 3) Shortly after, the screen freezes 4) Screen flashes to black (~5-10 sec) 5) Screen flashes back to the frozen desktop (~5-10 sec) 6) Screen goes black (not off), no response to input, switching to tty doesn't work. I was able to ssh into the machine from a laptop and get the dmesg output. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #161 from Gargoyle --- Hi there. I've been trying to solve some lockups and pauses with my system and have just read this entire thread. The good news is that I am another Radeon VII owner having the same problems and I am willing to do whatever I can to help. My current situation is:- - I'm running dual 2560x1440@60Hz via display port. - I am running the beta of ubuntu 19:10 (Linux ryzen1910 5.3.0-18-generic #19-Ubuntu SMP Tue Oct 8 20:14:06 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux). - I don't push the R:VII at all under Linux. I boot into Windows 10 to play games. - I have disabled IOMMU in BIOS/EFI. With IOMMU enabled things are MUCH worse. - My system is mostly stable. If the displays blank, sometimes after waking them I get the 15-30 second freeze. Then the "amdgpu [powerplay] Failed..." messages and then everything continues ok. I can semi-reliably recreate this by using the "xset dpms force off" command someone posted earlier. I've not managed to find any kind of pattern yet, but 8 out of 10 times running that command and then waking the system with a keypress/mouse click will cause the freeze. - I use X11 and not wayland. Not sure that is significant, but with Ubuntu 19:10 it seems wayland is started temporarily and then stopped during boot / starting gdm. If I enable IOMMU my GDM login screen will be completely corrupt. However, if I press enter (to select my user) and enter my password, my X11 gnome session starts. Although there are LOTS of pauses and warnings and errors all over the place in "journalctl -f". -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #160 from ReddestDream --- Well, today I had a hard freeze using more than one display with Radeon VII. Back to Radeon VII + iGPU . . . :( -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #159 from ReddestDream --- Oh. Also, cat /sys/kernel/debug/dri/0/amdgpu_pm_info Now seems to work on 5.3.4 with more than one monitor in. It doesn't report nonsense values like 0 watts like it did before. :) -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #158 from ReddestDream --- More good news. It seems that 5.3.4 does work for me and doesn't (at least immediately since I'm typing this from there right now) fall apart into a glitchy mess. I'm still not really sure of the complete stability of things tho because we do still see our old friend: "amdgpu: [powerplay] Failed to send message 0x28, response 0x0, amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed!" in dmesg. So, AFAICT, there's still something wrong. It's just more stable than it was before. But yeah. This is the first time since I've gotten this card that I've been able to boot to a DE w/o crashing and w/o disabling dpm. :) -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #157 from ReddestDream --- @Tom B. Well, some good news. Kernel 5.3.4 should have the patches for Radeon VII included now. I'll do some more tests on that ... -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #156 from Tom B --- This is strange because with a patched 5.3.1, I have perfect stability. An uptime of over a week and no issues. Are you saying that the issue comes back in 5.4? Hopefully not as Linux 5.4 + Mesa 19.3 looks to have a nice performance bump on the VII. With the patches, do you see the card boosting correctly? Do the wattage, voltage and clocks change under load? Asking an obvious question here, but is the crash temperature related? Maybe the patches increase power and overheat. If so, it might explain why I'm not affected as my card is water cooled. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #155 from ReddestDream --- So, I've done some tests with 5.4-rc1 and it seems like I'm getting similar results to line...@xcpp.org and sehell...@gmail.com. I'm using GNOME with Wayland (which works fine with only 1 display). Sometimes it works for a while. Sometimes I can't see the mouse cursor. Sometimes I get glitches all over the screen containing pieces and parts of previous framebuffers. But, I mean, it's better than 5.3 was, which was so bad I never could see anything and I would get stuck on blackscreen. At least on 5.4-rc1 I've been able to manually switch to a virtual console and reboot rather than force a reboot with the power button. Still hoping for some fix for this, but it's become less important to me as further improvements to GNOME and MESA have made the Radeon VII + iGPU setup I've been using run smoother. I've also discovered further issues on Windows regarding the high memory clock when using multiple monitors with Radeon VII, and it's been affecting performance there too. I'm considering just sticking with 1 monitor only with for this machine/card. lol -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #154 from line...@xcpp.org --- Created attachment 145623 --> https://bugs.freedesktop.org/attachment.cgi?id=145623=edit 5.4.0-rc1 hangup dmesg with 5.4.0-rc1. System freezes and becomes unresponsive to input like before -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #153 from ReddestDream --- Just FYI, it appears that kernel 5.3.2 does not have the Vega 20 fix commits that Alex Deucher mentioned. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #152 from ReddestDream --- Kernel 5.4-rc1, the first kernel version that includes the Vega 20 patches noted by Alex Deucher, is now out and in linux-mainline on Arch Linux AUR. :) I plan to do some testing of this version over the next few days, and it might be worth it for people who are still having issues to confirm on this version as well. Thanks! -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #151 from line...@xcpp.org --- Created attachment 145583 --> https://bugs.freedesktop.org/attachment.cgi?id=145583=edit 5.3.1 patched, xorg crash And here is a dmesg of just an X session crashing -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 line...@xcpp.org changed: What|Removed |Added Attachment #145581|0 |1 is obsolete|| --- Comment #150 from line...@xcpp.org --- Created attachment 145582 --> https://bugs.freedesktop.org/attachment.cgi?id=145582=edit 5.3.1 patched, wayland crash Sorry, the file got messed up, here is the wayland crash -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 line...@xcpp.org changed: What|Removed |Added CC||line...@xcpp.org --- Comment #149 from line...@xcpp.org --- Created attachment 145581 --> https://bugs.freedesktop.org/attachment.cgi?id=145581=edit 5.3.1 plus Alex's patches, kde wayland crash, then kde xorg crash This issue is not fixed for me with Alex's patches. I use only a single monitor via DP. Running a patched 5.3.1 kernel. Attached is a dmesg log: First a wayland KDE session crashes, I kill all user processes and restart sddm and start a KDE Xorg session, which later also crashes. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 Anthony Rabbito changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #148 from Anthony Rabbito --- Everyone's contribution is very much appreciated ! I can finally go back to using my workstation. Alex, thank you -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #147 from ReddestDream --- > Already merged to 5.4. I'll take a look at older kernels as well. @Alex Deucher Thanks so much for all your help! :) -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #146 from Alex Deucher --- (In reply to tom91136 from comment #145) > @Alex any plans for the patches to be merged for 5.4 or even backported to > 5.3 at some point? Already merged to 5.4. I'll take a look at older kernels as well. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #145 from tom91...@gmail.com --- @Alex any plans for the patches to be merged for 5.4 or even backported to 5.3 at some point? -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #144 from sehell...@gmail.com --- I also think this is strange. Since yesterday, they turned off and on many times successfully without any problems. Most likely, it's connected with something else, but I don’t know where to find. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #143 from Tom B --- I'm not sure how KDE handles monitor power behind the scenes but I have an uptime of 2 days now since applying the patches and with KDE I've let it turn off the monitors at least 6 or 7 times and suspend/resume 3 times without issue. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #142 from sehell...@gmail.com --- (In reply to Alex Deucher from comment #141) > (In reply to sehellion from comment #140) > > Created attachment 145463 [details] > > 5.3.1 with Alex's patches and dual monitors, crash > > That's not a crash, it's just a warning. But system hangs after. Today it happened twice. When I try to resume work, monitors turn on, then the secondary shows that there is no signal, and the primary shows a black screen. But perhaps this is not related to this bug. I can connect via ssh and see logs when this happens, if necessary. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #141 from Alex Deucher --- (In reply to sehellion from comment #140) > Created attachment 145463 [details] > 5.3.1 with Alex's patches and dual monitors, crash That's not a crash, it's just a warning. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 Alex Deucher changed: What|Removed |Added Attachment #145463|text/x-log |text/plain mime type|| -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 sehell...@gmail.com changed: What|Removed |Added Attachment #145461|0 |1 is obsolete|| --- Comment #140 from sehell...@gmail.com --- Created attachment 145463 --> https://bugs.freedesktop.org/attachment.cgi?id=145463=edit 5.3.1 with Alex's patches and dual monitors, crash -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #139 from sehell...@gmail.com --- Today, when trying to wake up the monitors, the system crashed again. WARNING: CPU: 4 PID: 32 at drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc_link_dp.c:1720 decide_link_settings+0xe0/0x2a0 [amdgpu] Full dmesg log has updated. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #138 from sehell...@gmail.com --- Created attachment 145461 --> https://bugs.freedesktop.org/attachment.cgi?id=145461=edit 5.3.1 with Alex's patches and dual monitors -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #137 from sehell...@gmail.com --- (In reply to Alex Deucher from comment #128) > Do these patches help? > https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm- > fixes=c46e5df4ac898108da66a880c4e18f69c74f6c1b > https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm- > fixes=c02d6a161395dfc0c2fdabb9e976a229017288d8 Yes, these patches fix the problem. amdgpu: [powerplay] Failed to send message 0x28, response 0x0 amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed! amdgpu :03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on sdma0 (-110). amdgpu :03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on page0 (-110). amdgpu: [powerplay] Failed to send message 0x26, response 0x0 amdgpu: [powerplay] Failed to set soft min gfxclk ! amdgpu: [powerplay] Failed to upload DPM Bootup Levels! amdgpu :03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on sdma1 (-110). [drm:process_one_work] *ERROR* ib ring test failed (-110). In general system is stable. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #133 from Anthony Rabbito --- Created attachment 145459 --> https://bugs.freedesktop.org/attachment.cgi?id=145459=edit dsmeg log with Alex's patches Here's my dsmeg with Alex's patches. Going to mess around and see what I can find. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #129 from Tom B --- Thank you Alex! That has fixed it! The card is now correctly setting its voltages and clocks. I applied the patch to 5.3.1 However, I've noticed a few very minor problems that are probably worth reporting. 1. I still get this in dmesg: [6.307005] amdgpu: [powerplay] Failed to send message 0x28, response 0x0 [6.307006] amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed! [9.225192] amdgpu :44:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on sdma0 (-110). [ 10.238621] amdgpu :44:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on page0 (-110). [ 10.532004] amdgpu: [powerplay] Failed to send message 0x26, response 0x0 [ 10.532005] amdgpu: [powerplay] Failed to set soft min gfxclk ! [ 10.532006] amdgpu: [powerplay] Failed to upload DPM Bootup Levels! Though this doesn't really matter, we were focussing our issue there earlier in the thread as it looked like `Set hard min uclk failed!` was the cause of the problem, obviously it isn't. 2. This repeats indefinitely in dmesg: [ 332.575747] [drm] schedsdma0 is not ready, skipping [ 332.582657] [drm] schedsdma0 is not ready, skipping [ 332.582864] [drm] schedsdma0 is not ready, skipping [ 332.708848] [drm] schedsdma0 is not ready, skipping [ 332.715975] [drm] schedsdma0 is not ready, skipping [ 332.716229] [drm] schedsdma0 is not ready, skipping [ 332.756987] [drm] schedsdma0 is not ready, skipping [ 332.763970] [drm] schedsdma0 is not ready, skipping [ 332.764169] [drm] schedsdma0 is not ready, skipping As you can see several dozens of times second this gets written to dmesg. This might be because the patches are intended to be used on 5.4? 3. The lowest wattage now seems to be 33w rather than 23w which means increased idle power usage and temps. This isn't really a problem but I thought it was worth mentioning and is a fair tradeoff for stability. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #131 from Tom B --- In addition to my previous comment, [drm] schedsdma0 is not ready, skipping repeating indefinitely stops after a suspend/resume. After the machine is resumed these stop appearing but it does suspend and resume correctly. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #132 from Anthony Rabbito --- Created attachment 145458 --> https://bugs.freedesktop.org/attachment.cgi?id=145458=edit linux-mainline5.3 dmesg without patches Here's my current dmesg with two out of three monitors running without the patches Alex provided. I'm currently compiling the kernel with his patches to look at the differences and see if I can get my third monitor to boot up. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #134 from Anthony Rabbito --- Wow ! All three of my monitors are working again. 2560x1440 @ 144Hz -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #135 from Adrian Brown --- @reddestdream Thanks. I don't think the active adapter is the problem as it works perfectly with my Vega 64. However I will try 18.04 and AMD's driver as suggested. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #136 from tom91...@gmail.com --- Been following this thread for a while now as I just got 3 4k 60Hz monitors connected to the 3 DP ports on my Radeon VII. I'm getting the exact same errors discussed in this report with matching dmesg outputs. I've applied the patches to Fedora 31's 5.3.0-3 kernel and everything now works perfectly! Just a few notes: * Idle power draw before patch was 22W in lm_sensors, now it's reading 28W, makes sense as the memory is now properly clocked. This also loosely matches @Tom B's results. * I did not get the repeated `[drm] schedsdma0 is not ready, skipping` in dmesg, however, it is still possible to trigger a freeze by toggling dpms: xset dpms force off Resulting in: [ 155.431068] amdgpu: [powerplay] Failed to send message 0x28, response 0x0 [ 155.431070] amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed! [ 161.334003] amdgpu: [powerplay] Failed to send message 0x26, response 0x0 [ 161.334004] amdgpu: [powerplay] Failed to set soft min gfxclk ! [ 161.334005] amdgpu: [powerplay] Failed to upload DPM Bootup Levels! [ 164.622060] amdgpu: [powerplay] Failed to send message 0x28, response 0x0 [ 164.622062] amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed! Previously, without the patch, the machine hangs. With the patch, the display freezes for a few seconds and then power off. Mouse movement correctly turns on all screen and everything is back to normal. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #130 from Anthony Rabbito --- (In reply to Alex Deucher from comment #128) > Do these patches help? > https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm- > fixes=c46e5df4ac898108da66a880c4e18f69c74f6c1b > https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm- > fixes=c02d6a161395dfc0c2fdabb9e976a229017288d8 I will try to apply these patches in a few hours.Though I must say in 5.3 things have been much better. Not perfect and I haven't tried triple monitor yet, but definitely improvement -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #128 from Alex Deucher --- Do these patches help? https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes=c46e5df4ac898108da66a880c4e18f69c74f6c1b https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes=c02d6a161395dfc0c2fdabb9e976a229017288d8 -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #127 from Alex Deucher --- (In reply to Tom B from comment #15) > Have been running 5.0 since release without issue but upgraded this morning > and got crashes as described here within a few seconds of boot. > Can you bisect between 5.0 and 5.1 and see what commit caused the regression? -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #126 from ReddestDream --- @Adrian Brown Your Linux issue is potentially related to the active adapter. Have you tried w/o it? On Windows, the flickering on/around login, at least for me, has been mostly resolved by using the latest AMD driver + Windows 10 1903 and all the recent updates. There was a Windows update about a month ago that resolved a lot of flickering issues by fixing a bug in Windows's 10-bit color support. Also, if you are using Ubuntu, it might be worth downgrading to 18.04.3 so that you can use the Radeon Software for Linux Driver: https://www.amd.com/en/support/graphics/amd-radeon-2nd-generation-vega/amd-radeon-2nd-generation-vega/amd-radeon-vii Currently, I hear that using AMD's driver + a supported distro is the best way to get stability out of Radeon VII. And it's something I will probably end up trying myself if there's no resolution to the issues forthcoming with 5.4, which will be the new LTS. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #125 from Adrian Brown --- I am also getting frequent crashes with a Radeon VII on Kubuntu 19.10 (kernel 5.0.0-29-generic). I see there is some discussion in this thread about it possibly being related to multiple monitors. But I don't think that's the case. I have a single monitor but it is old with only a dual link DVI connection. So I am using displayport on the GPU but connected to an active adapter to convert DP to a dual link DVI connection (my monitor is a Dell 3007WFP running at 2560x1600). I often get crashes soon after boot. They tend to happen in clusters so it crashes a few times, then stays stable for a short time and then crashes again. I don't get these crashes on the same system when dual booted into Windows 10 so the hardware itself seems good. One thing worth mentioning is that on Windows 10 I occasionally get a black screen and the monitor goes off for a couple of seconds. It then comes back to life. Apparently this is not uncommon and the suspicion in the Windows community is that AMD drivers sometimes crash but Windows recovers (I never had this with my Vega 64, only with the Radeon VII). It most likely is a completely different issue of course, but thought it worth mentioning. Still hoping for a fix at some point. Also happy to help test any fix. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #124 from ReddestDream --- Created attachment 145254 --> https://bugs.freedesktop.org/attachment.cgi?id=145254=edit Dmesg 5.3-rc7 w/ Two monitors This issue is still not fixed on 5.3-rc7. I guess we will probably have to wait until 5.4 (the next LTS) before more people take a look at this issue. :( -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #123 from ReddestDream --- A few interesting fixes that touch vega20_hwmgr.c have rolled in from drm-fixes: The first is likely the most interesting for our issues, as it touches min/maxes (tho only the soft ones it seems). The other two are related to SMU versions. https://github.com/torvalds/linux/commit/83e09d5bddbee749fc83063890244397896a1971 https://github.com/torvalds/linux/commit/21649c0b6b7899f4fa3099c46d3d027f60b107ec https://github.com/torvalds/linux/commit/23b7f6c41d4717b1638eca47e09d7e99fc7b9fd9 I haven't tested them out yet, but it does give me some hope that someone is still looking at Vega 20/Radeon VII . . . -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #122 from ReddestDream --- Tested 5.3-rc6. Still has the same issues. Only it's maybe actually worse because I lose display completely when I use amdgpu.dpm=2 w/Radeon VII multimonitor on 5.3-rc6, whereas on 5.2.9 I just got same/similar errors to default. I'm working a kernel fork of 5.3-rc6 where I'm reverting various things and adding things in from Vega 10/12 and Navi to see if it helps. Haven't compiled and tested it yet but since I know 5.3-rc6 itself boots, compiles, and demonstrates the issue I guess it's a good base until 5.3 releases. https://github.com/ReddestDream/linux Any ideas anyone has are appreciated. For now I actually find that amdgpu.dpm=0 with both 4K monitors on Radeon VII allows for much snappier generic desktop than my previous setup with AMD+iGPU. It's amazing how well this card runs 4K displays w/o any proper memory clock management at all. I'm sure the gaming performance would be pretty bad tho, but I have Windows for that for now . . . -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #121 from ReddestDream --- Some observations: 1. Nothing at all seems to be up with cur_speed and cur_width. They get set several times in a row in both runs, but the values are all the same in both. 2. I can't really see anything up with msg/parameter either. When I compare them to each other nothing seems particularly wacky. And we also have an instance in my AMD+iGPU run where we see msg/parameter after "[drm] Initialized amdgpu", so the theory that all messages have to be sent before Initialization is complete must be wrong. Now the real question is if we can decode what these msg/parameter values mean. But it looks more likely to me that vega20_hwmgr.c and vega20_ppt.c are just bugged somewhere (probably in the same way since they seem to be alternate versions of each) and that the rest of the amdgpu code is (relatively) fine. I'm thinking we'll have to go through and knock out/debug pretty much everything in those files until we figure out where the breakage is. That's about 3000-4000 lines of code in each of those two files tho. So any thoughts anyone has about where we should start would be helpful. My focus will probably be on UCLK (since it seems to break first), SCLK (since it gets set to 0 MHz when there's multiple displays), DCEFCLK, and basically anything else that smells like it might control the memory clock and/or be affected by multiple monitors. Thanks! -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #120 from ReddestDream --- Created attachment 145159 --> https://bugs.freedesktop.org/attachment.cgi?id=145159=edit DebugAMDiGPU Also here is the AMD + iGPU one. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #119 from ReddestDream --- Created attachment 145158 --> https://bugs.freedesktop.org/attachment.cgi?id=145158=edit DebugAMD2Monitors >I don't think I have time to try it today but if anyone is recompiling the >code adding >pr_err("msg: %d / parameter: %d\n", msg, parameter); >to this function in smumgr.c would be a useful addition. So, I've done just this. I also added a speed/width check to amdgpu_device_get_min_pci_speed_width in amdgpu_device.c to check the values of cur_speed and cur_width. I ran two checks with 5.2.9, one with two monitors on Radeon VII and another with my stable 1 monitor on each Radeon VII and Intel iGPU. Please find them attached. Thanks so much for all your help! -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #118 from ReddestDream --- So, this is a crazy idea, but ironically I think it might be getting closer to the truth. Tom B. attempted reverting ad51c46eec739c18be24178a30b47801b10e0357, which was known to cause some issue with an RX 580. He found that doing so fixed the multimonitor crash but locked the card to the lowest possible memory speed, which really isn't acceptable. Perhaps our issue seem is connected to insufficient or improperly calculated PCIe bandwidth/speed. Speed mismatches can and will cause messages to not go through to the peripheral. It's also well-known that Radeon VII was originally a PCIe 4.0 card that AMD locked down to the 3.0 speeds . . . What if when using multiple monitors and/or higher clock speeds Radeon VII uses more bandwidth than Linux expects, causing the loss of communication? Something else I plan to investigate. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #116 from ReddestDream --- Created attachment 145153 --> https://bugs.freedesktop.org/attachment.cgi?id=145153=edit dmesgAMD2Monitors I've been doing a few tests. I looked into and compiled 5.3-rc5 along with these patches, but nothing seemed to resolve our multimonitor issue. :/ https://phoronix.com/scan.php?page=news_item=AMDGPU-Multi-Monitor-vRAM-Clock I've also gotten some dmesg output with 5.2.9 with amdgpu.dc_log=1 drm.debug=0x1e log_buf_len=2M. Turns out that amdgpu.dc_log=1 does nothing on this kernel, but I didn't know this when I ran the tests. The interesting added data appears to be coming from drm.debug=0x1e. I have two (physically) identical LG 24UD58-B 4K60 monitors connected via DP. One test was done with both monitors connected to Radeon VII, and the other was done using my stable Intel+Radeon VII setup where one monitor is connected to Radeon VII and the other is connected to the Intel iGPU (HD 630, also via DP at 4K60). These dmesg dumps were taken with all DMs/DEs/Graphics disabled in order to limit interference. The system was booted to a text commandline at native resolution. Since 5.3 isn't changing anything, I plan to do a recompile of 5.2.9 (or 5.2.10 if it's out for Arch) with the smum_send_msg_to_smc_with_parameter patch suggested by Tom B. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #115 from Tom B --- I should have noted it earlier, but I had already tried reverting both "golden values" commits. I've no idea what it does but it didn't fix this crash. One thing that would be insightful would be logging every call to smum_send_msg_to_smc_with_parameter and printing out message/parameter: int smum_send_msg_to_smc_with_parameter(struct pp_hwmgr *hwmgr, uint16_t msg, uint32_t parameter) { This would cause a very busy log but we could see the last successful message that was sent and with the same log in 5.0.13 see if there are any obvious differences. It might be that the previous message causes the invalid state so knowing what that is could lead us towards the solution. I don't think I have time to try it today but if anyone is recompiling the code adding pr_err("msg: %d / parameter: %d\n", msg, parameter); to this function in smumgr.c would be a useful addition. Also, wants to try re-compiling, here's a quick guide for arch: 1. Get the kernel sources using asp as described here: https://wiki.archlinux.org/index.php/Kernel/Arch_Build_System navigate to the created linux/repos/core-x86_64 directory. 2. You will need to run makepkg -s once to get it to download the sources. 3. You can set the kernel version in PKGBUILD: e.g. _srcver=5.2.7-arch1 or _srcver=5.0.13-arch1 4. If you want to revert one or more commits put it in the prepare() block before local src: echo "$_kernelname" > localversion.20-pkgname git revert db64a2f43c1bc22c5ff2d22606000b8c3587d0ec --no-edit git revert f5e79735cab448981e245a41ee6cbebf0e334f61 --no-edit local src It will open your editor, if you don't want to use vi set: 5. For making changes to the code you need to make a patch. Open the src/archlinux-linux directory. The files you're interested in are in drivers/drm/gpu/drm/amd/powerplay likely hwmgr/vega20_hwmgr.c Make your changes to the code. You can't just re-run makepkg as it checks out the original version of the code. After making changes, navigate to the archlinux-linux directory and run git diff > ../../vii.patch 6. Add your patch to PKGBUILD source: source=( "$_srcname::git+https://git.archlinux.org/linux.git?signed#tag=v$_srcver; config # the main kernel config file 60-linux.hook # pacman hook for depmod 90-linux.hook # pacman hook for initramfs regeneration linux.preset # standard config files for mkinitcpio ramdisk vii.patch ) 7. I've been cheating with makepkg and getting it to skip hash checks as otherwise you have to generate the sha256sums for each patch you create. This is an extra step that only slows down testing. To compile/install run makepkg -si --skipinteg Because of the way makepkg works, it keeps the compiled code in the src directory. That means that although the first compile will take a few minutes, subsequent compiles will be a lot faster as it'll probably only be recompiling vega20_hwmgr.c -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #114 from ReddestDream --- 5. Tom B., it is probably worth getting a full dmesg with your two monitors in on a relatively new 5.2.x kernel using at least: amdgpu.dc_log=1 drm.debug=0x1e log_buf_len=2M And anything else you might think of. Just to try to get more debug info. Thx! -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #113 from ReddestDream --- 4. > Given that two different versions of the code produce the same result, my > hunch is that the problem is B. The card is not in a state where it's able to > receive power changes. Something to consider: In pretty much all the dmesg logs we see, amdgpu attempts to reset the GPU, sometimes successfully, and yet it still can't properly message the GPU afterward and we see the same sequence of failures starting with "amdgpu: [powerplay] Failed to send message 0x28, response 0x0 amdgpu: [powerplay] [SetUclkToHightestDpmLevel] Set hard min uclk failed!" Eventually we start to see: "[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!" This comes from: https://github.com/torvalds/linux/commits/master/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c I'm not sure what the -125 error code indicates. My guess is ECANCELED (Operation Cancelled) as the negated error code 125. https://github.com/torvalds/linux/blob/master/include/uapi/asm-generic/errno.h -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #112 from ReddestDream --- More ideas: 3. Looking through the crash in sehellion's comment 45: gfx_v9_0_ring_test_ring+0x19e/0x230 [amdgpu] amdgpu_ring_test_helper+0x1e/0x90 [amdgpu] gfx_v9_0_hw_fini+0x299/0x690 [amdgpu] amdgpu_device_ip_suspend_phase2+0x6c/0xa0 [amdgpu] amdgpu_device_ip_suspend+0x44/0x80 [amdgpu] amdgpu_device_pre_asic_reset+0x1ef/0x204 [amdgpu] amdgpu_device_gpu_recover+0x7b/0x7a3 [amdgpu] amdgpu_job_timedout+0xfc/0x120 [amdgpu] We see gfx_v9_0_ring_test and gfx_v9_0_hw_fini which both come from: https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c There's a 5.1-rc1 commit in this file pertaining to a "wave ID mismatch" that could cause deadlocks. https://github.com/torvalds/linux/commit/41cca166cc57e75e94d888595a428d23a3bf4e36 Along with updated "golden values" for Vega in 5.1-rc1: https://github.com/torvalds/linux/commit/919a94d8101ebc29868940b580fe9e9811b7dc86 https://github.com/torvalds/linux/commit/f7b1844bacecca96dd8d813675e4d8adec02cd66 -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #111 from ReddestDream --- A few other ideas to ponder: 1. Looking into DPM, I found this commit for 5.1-rc1 that looks interesting: https://github.com/torvalds/linux/commit/7ca881a8651bdeffd99ba8e0010160f9bf60673e Looks like it exposes "ppfeatures" interface on Vega 10 and later GPU, including some code for Vega 20. 2. I also found two interesting commits that pertain to "doorbell" register initialization on Vega 20. Also from 5.1-rc1. Might be related to setting up the GPU ASICs . I must admit I'm not exactly sure what these do . . . https://github.com/torvalds/linux/commit/fd4855409f6ebe015406cd2b2ffa4fee4cd1f4a7 https://github.com/torvalds/linux/commit/828845b7c86c5338f6ca02f4b525718f31b2 -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #110 from ReddestDream --- > 1. The functions in vega20_ppt.c are used with this new patch so that answers > my question from earlier, that's what this file is for and why it contains > similar/identical functions. I was hoping this was the case as the duplicated functions were confusing me too. Glad we got this figured out! :) > I tried it, it didn't help the crashing issue and I was stuck at 30w. As soon > as I started sddm the system froze. I've attached my dmesg from amdgpu.dpm=2 > boot. It doesn't fix the issue but it does help answer a few questions I had: This is disappointing tho. I was hoping that setting amdgpu.dpm=2 would use the more "actively developed" path and that would fix the issue. :/ > Given that two different versions of the code produce the same result, my > hunch is that the problem is B. The card is not in a state where it's able to > receive power changes. I tend to agree, but it's still not clear why or how the card ends up in a bad state when commands to it via smu_send_smc_msg_with_param seem to just suddenly stop working. And given the amount of same/similar functions in vega20_hwmgr.c and vega20_ppt.c it's hard to rule out A entirely. Since amdgpu.dpm=0 resolves the issue (albeit at the cost of being stuck at minimum clocks inherited from the VBIOS/GOP/UEFI/firmware), it seems that the card is starting out in a reasonable state and then being thrown into a bad state later by bad driver code. And that code is part of the DPM (Dynamic Power Management) system. We are pretty confident that dpm_state.hard_min_level is stable the whole time, so that's probably not what's throwing the card into a bad state. But perhaps another value in the DPM table is . . . It doesn't make intuitive sense that the soft min/max values would be problematic since they are presumably "more flexible," but it's possible that they get calculated out of spec or something and logging them should be possible like how dpm_state.hard_min_level was logged. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #109 from Tom B --- Created attachment 145080 --> https://bugs.freedesktop.org/attachment.cgi?id=145080=edit dmesg with amdgpu.dpm=2 > Tom B., did you try booting with amdgpu.dpm=1 or amdgpu.dpm=2 (default is > generally -1 for automatic)? Seems like one of those might enable the new > experimental SW SMU v11 feature on Vega20 . . . Now that is interesting.dpm=-1 is the same as default, and default is 1, enabled so dpm=1 is what we've been using all along. But dpm=2 and the patch you linked to are interesting. I tried it, it didn't help the crashing issue and I was stuck at 30w. As soon as I started sddm the system froze. I've attached my dmesg from amdgpu.dpm=2 boot. It doesn't fix the issue but it does help answer a few questions I had: 1. The functions in vega20_ppt.c are used with this new patch so that answers my question from earlier, that's what this file is for and why it contains similar/identical functions. 2. It explains the difference I found in comment 97: This commit https://github.com/torvalds/linux/commit/94ed6d0cfdb867be9bf05f03d682980bce5d0036 has the new else block for smu_display_configuration_change which we now know is the software version of this function. More importantly, though, knowing that enabling DPM causes the crash, this tells us either: A) The bug is present in both versions of the vega20 code: vega20_hwmgr.c and vega20_ppt.c or.. B) The card reaches an invalid state before DPM is initialised and the card is fine until it receives a DPM change. Given that two different versions of the code produce the same result, my hunch is that the problem is B. The card is not in a state where it's able to receive power changes. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #108 from ReddestDream --- > Booting with amdgpu.dpm=0 on 5.2.7 works. Tom B., did you try booting with amdgpu.dpm=1 or amdgpu.dpm=2 (default is generally -1 for automatic)? Seems like one of those might enable the new experimental SW SMU v11 feature on Vega20 . . . https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html https://lists.freedesktop.org/archives/amd-gfx/2019-January/030788.html?print=anzwix -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #107 from ReddestDream --- > Booting with amdgpu.dpm=0 on 5.2.7 works. > It is a DPM issue of some kind so although my earlier tests showed that > hard_min_level was set correctly, it still could be an issue elsewhere in the > DPM table. Great news! At least now we have a better place to investigate . . . -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #106 from Tom B --- Booting with amdgpu.dpm=0 on 5.2.7 works. Performance is poor and as expected I cannot get any information about power states because /sys/kernel/debug/dri/0/amdgpu_pm_info doesn't exist. I'm guessing it runs at minimum clocks as I get ~10-17fps in unigine-heaven instead of ~60-100. It is a DPM issue of some kind so although my earlier tests showed that hard_min_level was set correctly, it still could be an issue elsewhere in the DPM table. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #105 from Tom B --- > Also, I considered that both of my monitors have audio out support. I wonder > if audio initialization might be the missing piece to the puzzle, the thing > that interrupts/changes the state of the card and prevents > smu_send_smc_msg_with_param from working where it did before. I know that in > the past with previous AMD cards, display audio has been buggy . I just tried setting admgpu.audio=0 and it didn't help. Though it doesn't rule out audio entirely, the audio backend is probably still used as part of the connection to the monitor, I'd imagine it just prevents the card appearing as an output device. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #104 from Tom B --- I did get very similar crashing when I was running HDMI + DP at different refresh rates ( see https://bugs.freedesktop.org/show_bug.cgi?id=110510 ). I switched to DP + DP because HDMI+DP wasn't stable, it could be related. the tl;dir from that bug report, and this was on 5.0.9: - HDMI alone at 60hz works but the screen flickers off every 3-5 minutes - HDMI alone works at 59.9hz without any flickering - HDMI 60hz + DP 60hz works, but the HDMI screen flickers off every 3-5 minutes - HDMI 59.94hz + DP 60hz freezes the PC instantly. Unfortunately my monitors don't support displayport at 59.94hz so I couldn't test that combination as I think it would have worked. Still, it does tell us that these could be related and the issue could be syncing between the two displays. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #103 from Peter Hercek --- I boot in BIOS mode and I'm still getting these errors. Though they are rare in my case with the "better" kernels (around once a week). Just a note: There were tearing errors in windows drivers of Radeon VII too. One of the reasons for it was different refresh rate for different monitors. They recommended to set all refresh rates to 60 Hz or its multiple till it is fixed. In my case it is not completely possible (one monitor supports 60 Hz, but other two monitors support only 59.95 Hz). I have slight difference in the frequencies. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #102 from Tom B --- > Grasping at straws a bit here, but it occurred to me that maybe Linux kernel > testing on Radeon VII was done on an early VBIOS that didn't have full UEFI > support yet. We know that AMD had to issue a VBIOS update for Radeon VII to > fix UEFI support shortly after the launch. So maybe enabling the CSM/Legacy > Support in the BIOS, which does impact early GPU initialization, might have > some effect on the multimonitor problem? Something I plan to test, but I > wanted to share the idea in case someone else has a chance first. I had already tried that unfortunately, I tried the following BIOS options: CSM on/off IOMMU on/of PCIE speed 16x/4x (the only options my motherboard allowed for some reason) Having said that, I didn't try booting using grub in BIOS mode as I didn't want to change my partition table, so it's possible that although I had used CSM, it was only legacy support and still booting in UEFI mode. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #101 from ReddestDream --- Grasping at straws a bit here, but it occurred to me that maybe Linux kernel testing on Radeon VII was done on an early VBIOS that didn't have full UEFI support yet. We know that AMD had to issue a VBIOS update for Radeon VII to fix UEFI support shortly after the launch. So maybe enabling the CSM/Legacy Support in the BIOS, which does impact early GPU initialization, might have some effect on the multimonitor problem? Something I plan to test, but I wanted to share the idea in case someone else has a chance first. >This might not mean anything, but it could be another clue that initilization >is happening before the card is really ready. Also, I considered that both of my monitors have audio out support. I wonder if audio initialization might be the missing piece to the puzzle, the thing that interrupts/changes the state of the card and prevents smu_send_smc_msg_with_param from working where it did before. I know that in the past with previous AMD cards, display audio has been buggy . . . -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #100 from Tom B --- I've bee trying to work backwards to find the place where screens get initialised and eventually call vega20_pre_display_configuration_changed_task. vega20_pre_display_configuration_changed_task is exported as pp_hwmgr_func::display_config_changed Which is called form hardwaremanager.c:phm_pre_display_configuration_changed phm_pre_display_configuration_changed is called from hwmghr.c:hwmgr_handle_task: switch (task_id) { case AMD_PP_TASK_DISPLAY_CONFIG_CHANGE: ret = phm_pre_display_configuration_changed(hwmgr); pp_dpm_dispatch_tasks is exported as amd_pm_funcs::dispatch_tasks is called from amdgpu_dpm_dispatch_task which is called in amdgpu_pm.c: void amdgpu_pm_compute_clocks(struct amdgpu_device *adev) { int i = 0; if (!adev->pm.dpm_enabled) return; if (adev->mode_info.num_crtc) amdgpu_display_bandwidth_update(adev); for (i = 0; i < AMDGPU_MAX_RINGS; i++) { struct amdgpu_ring *ring = adev->rings[i]; if (ring && ring->sched.ready) amdgpu_fence_wait_empty(ring); } if (is_support_sw_smu(adev)) { struct smu_context *smu = >smu; struct smu_dpm_context *smu_dpm = >smu.smu_dpm; mutex_lock(&(smu->mutex)); smu_handle_task(>smu, smu_dpm->dpm_level, AMD_PP_TASK_DISPLAY_CONFIG_CHANGE); mutex_unlock(&(smu->mutex)); } else { if (adev->powerplay.pp_funcs->dispatch_tasks) { if (!amdgpu_device_has_dc_support(adev)) { mutex_lock(>pm.mutex); amdgpu_dpm_get_active_displays(adev); adev->pm.pm_display_cfg.num_display = adev->pm.dpm.new_active_crtc_count; adev->pm.pm_display_cfg.vrefresh = amdgpu_dpm_get_vrefresh(adev); adev->pm.pm_display_cfg.min_vblank_time = amdgpu_dpm_get_vblank_time(adev); /* we have issues with mclk switching with refresh rates over 120 hz on the non-DC code. */ if (adev->pm.pm_display_cfg.vrefresh > 120) adev->pm.pm_display_cfg.min_vblank_time = 0; if (adev->powerplay.pp_funcs->display_configuration_change) adev->powerplay.pp_funcs->display_configuration_change( adev->powerplay.pp_handle, >pm.pm_display_cfg); mutex_unlock(>pm.mutex); } amdgpu_dpm_dispatch_task(adev, AMD_PP_TASK_DISPLAY_CONFIG_CHANGE, NULL); } else { mutex_lock(>pm.mutex); amdgpu_dpm_get_active_displays(adev); amdgpu_dpm_change_power_state_locked(adev); mutex_unlock(>pm.mutex); } } } This is the only place I can see AMD_PP_TASK_DISPLAY_CONFIG_CHANGE being called from, which eventually is where vega20_pre_display_configuration_changed_task gets called. Presumably the code: for (i = 0; i < AMDGPU_MAX_RINGS; i++) { struct amdgpu_ring *ring = adev->rings[i]; if (ring && ring->sched.ready) amdgpu_fence_wait_empty(ring); } is what generates [3.683718] amdgpu :44:00.0: ring gfx uses VM inv eng 0 on hub 0 [3.683719] amdgpu :44:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0 [3.683720] amdgpu :44:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0 [3.683720] amdgpu :44:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0 [3.683721] amdgpu :44:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0 [3.683722] amdgpu :44:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0 [3.683722] amdgpu :44:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0 [3.683723] amdgpu :44:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0 [3.683724] amdgpu :44:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0 [3.683724] amdgpu :44:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0 [3.683725] amdgpu :44:00.0: ring sdma0 uses VM inv eng 0 on hub 1 [3.683726] amdgpu :44:00.0: ring page0 uses VM inv eng 1 on hub 1 [3.683726] amdgpu :44:00.0: ring sdma1 uses VM inv eng 4 on hub 1 [3.683727] amdgpu :44:00.0: ring page1 uses VM inv eng 5 on hub 1 [3.683728] amdgpu :44:00.0: ring uvd_0 uses VM inv eng 6 on hub 1 [3.683728] amdgpu :44:00.0: ring uvd_enc_0.0 uses
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #99 from Tom B --- Created attachment 145062 --> https://bugs.freedesktop.org/attachment.cgi?id=145062=edit a list of commits 5.0.13 - 5.1.0 Attached is a list of all amdgpu and powerplay commits from 5.0.13 - 5.1.0. I have tried reverting the following which looked most likely culprits: 919a94d8101ebc29868940b580fe9e9811b7dc86 drm/amdgpu: fix CPDMA hang in PRT mode for VEGA20 f7b1844bacecca96dd8d813675e4d8adec02cd66 drm/amdgpu: Update gc golden setting for vega family d25689760b747287c6ca03cfe0729da63e0717f4 drm/amdgpu/display: drm/amdgpu/display: Keep malloc ref to MST port -- A change to the way displayport connectors are handled, looked promising. db64a2f43c1bc22c5ff2d22606000b8c3587d0ec drm/amd/powerplay: fix possible hang with 3+ 4K monitors I also looked at that last one in detail as it seems very close to this bug. Nothing in the code looks for 3+ monitors or even 4k. It only actually looks for > 1 monitor. Although it's based on disable_mclk_switching, I also tried forcing disable_fclk_switching to true and false, neither had any affect. The result is that mclk would be calculated based on screens but fclk would be forced on/off. It didn't help but I can't help think that this commit is a little too close to this issue to be irrelevant. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #98 from Sylvain BERTRAND --- > The code seems very similar to what we see in > vega20_notify_smc_display_config_after_ps_adjustment near where we get the " > [SetHardMinFreq] Set hard min uclk failed!" Maybe this > smum_send_msg_to_smc_with_parameter get through where others fail because of > the formatting or something? It seems there is a patch from amd about smu v11 and this smc/smu command. I may be wrong though. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #97 from Tom B --- I've been investigating this: https://github.com/torvalds/linux/commit/94ed6d0cfdb867be9bf05f03d682980bce5d0036 Because vega20 doesn't export display_configuration_change, it jumps to the newly added else block and calls smu_display_configuration_change. This didn't happen in 5.0.13. It's not the cause of this as I commented it out and it still breaks. I'll also note that pp_display_cfg->display_count is correct at this point, it shows 2 for me with 2 screens connected. But why doesn't vega20 export display_configuration_change? It has display_config_changed and I can't find where that's called from so I wonder if display_config_changed should be being called at this point. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #96 from Tom B --- Created attachment 145047 --> https://bugs.freedesktop.org/attachment.cgi?id=145047=edit logging anywhere the number of screens is set Again, no closer to a fix but another thing to rule out. In addition to SMU_MSG_NumOfDisplays, PPSMC_MSG_NumOfDisplays is also used. I put a debug message anywhere PPSMC_MSG_NumOfDisplays or SMU_MSG_NumOfDisplays is set end put else blocks in places where it may have been set: if ((data->water_marks_bitmap & WaterMarksExist) && data->smu_features[GNLD_DPM_DCEFCLK].supported && data->smu_features[GNLD_DPM_SOCCLK].supported) { pr_err("vega20_display_configuration_changed_task setting PPSMC_MSG_NumOfDisplays to %d\n", hwmgr->display_config->num_display); result = smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_NumOfDisplays, hwmgr->display_config->num_display); } else { pr_err("vega20_display_configuration_changed_task not setting PPSMC_MSG_NumOfDisplays\n"); } return result; } Here's what I found: - The functions dealing with screesn in vega20_ppt.c are never used ( vega20_display_config_changed, vega20_pre_display_config_changed) and can be ignored for our further tests - The line: result = smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_NumOfDisplays, hwmgr->display_config->num_display); Is never executed, it always triggers the else block so PPSMC_MSG_NumOfDisplays is never set using num_display. - The same thing happens in 5.0.13, when I saw the above result I had hoped that the problem was that smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_NumOfDisplays, hwmgr->display_config->num_display); was never called with the correct number of displays. Unfortunately the behaviour is the same on 5.0.13, PPSMC_MSG_NumOfDisplays is only ever set to zero in both versions of the kernel. Unfortunately this doesn't get us any closer. The instruction is sent a lot more in 5.0.13 though. 5.0.13: [3.475471] amdgpu :44:00.0: ring vce1 uses VM inv eng 13 on hub 1 [3.475472] amdgpu :44:00.0: ring vce2 uses VM inv eng 14 on hub 1 [3.475508] amdgpu: [powerplay] vega20_display_configuration_changed_task not setting PPSMC_MSG_NumOfDisplays [3.794037] amdgpu: [powerplay] vega20_pre_display_configuration_changed_task setting PPSMC_MSG_NumOfDisplays to 0 [3.800180] amdgpu: [powerplay] vega20_display_configuration_changed_task not setting PPSMC_MSG_NumOfDisplays [3.833502] amdgpu: [powerplay] vega20_pre_display_configuration_changed_task setting PPSMC_MSG_NumOfDisplays to 0 [3.833647] amdgpu: [powerplay] vega20_display_configuration_changed_task not setting PPSMC_MSG_NumOfDisplays [4.153232] [drm] Initialized amdgpu 3.27.0 20150101 for :44:00.0 on minor 0 [4.664044] amdgpu: [powerplay] vega20_pre_display_configuration_changed_task setting PPSMC_MSG_NumOfDisplays to 0 5.2.7 [3.711028] amdgpu :44:00.0: ring vce1 uses VM inv eng 13 on hub 1 [3.711028] amdgpu :44:00.0: ring vce2 uses VM inv eng 14 on hub 1 [4.086310] amdgpu: [powerplay] vega20_pre_display_configuration_changed_task setting PPSMC_MSG_NumOfDisplays to 0 [4.385470] [drm] Initialized amdgpu 3.32.0 20150101 for :44:00.0 on minor 0 [4.522398] amdgpu: [powerplay] Failed to send message 0x28, response 0x0 Notice that vega20_pre_display_configuration_changed_task is run 5 times between the ring lines and initilization line in 5.0.13 and only once in 5.2.7. This might not mean anything, but it could be another clue that initilization is happening before the card is really ready. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #95 from Tom B --- So here's something interesting. In 5.0.13 there is no function vega20_display_config_changed. This function issues smu_send_smc_msg_with_param(smu, SMU_MSG_NumOfDisplays, 0); In fact, in 5.0.13 there is no reference at all to SMU_MSG_NumOfDisplays anywhere in the amdgpu driver. Which means, the way that the number of displays is configured is changed in 5.0.13, or done with a hardcoded value instead of a constant. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #94 from Tom B --- Reverting d1a3e239a6016f2bb42a91696056e223982e8538 didn't fix it for me. But that commit may give some insight because it is related to uclk which is the first error we get. I also tried globally increasing usec_timeout as it's used in a few places (patch below). This makes the PC take about a minute to boot up, so clearly the GPU is in an invalid state before these timeouts are hit and then each subsequent call to smum_send_msg_to_smc_with_parameter causes a delay because each call times out. Whatever happens, puts the card into a state that it can't recover from. The next step is to try to find where vega20_set_uclk_to_highest_dpm_level is called from and see what happens just before the call to this function. diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index f4ac632a87b2..9b878c74b17e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2418,7 +2418,7 @@ int amdgpu_device_init(struct amdgpu_device *adev, adev->pdev = pdev; adev->flags = flags; adev->asic_type = flags & AMD_ASIC_MASK; - adev->usec_timeout = AMDGPU_MAX_USEC_TIMEOUT; + adev->usec_timeout = AMDGPU_MAX_USEC_TIMEOUT*10; if (amdgpu_emu_mode == 1) adev->usec_timeout *= 2; adev->gmc.gart_size = 512 * 1024 * 1024; diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c b/drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c index a7e8340baf90..a6b2bc4277ef 100644 --- a/drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c +++ b/drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c @@ -84,7 +84,7 @@ int hwmgr_early_init(struct pp_hwmgr *hwmgr) if (!hwmgr) return -EINVAL; - hwmgr->usec_timeout = AMD_MAX_USEC_TIMEOUT; + hwmgr->usec_timeout = AMD_MAX_USEC_TIMEOUT*10; hwmgr->pp_table_version = PP_TABLE_V1; hwmgr->dpm_level = AMD_DPM_FORCED_LEVEL_AUTO; hwmgr->request_dpm_level = AMD_DPM_FORCED_LEVEL_AUTO; -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #93 from Chris Hodapp --- Note: It might be good for someone else to double-check my conclusion before too much stock is put into it. Scientific method and all that. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #92 from ReddestDream --- >If you follow the callstack: I've been thinking all this over. The only thing unfortunately that really sticks out at me still is how Chris Hodapp says that reverting this commit: https://github.com/torvalds/linux/commit/d1a3e239a6016f2bb42a91696056e223982e8538#diff-0bc07842bc28283d64ffa6dd2ed716de Seems to improve things. Considering that we now know from Tom B.'s work that dpm_state.hard_min_level is apparently calculated correctly and stable the entire time, it doesn't make sense that reverting this commit could fix anything. The code seems very similar to what we see in vega20_notify_smc_display_config_after_ps_adjustment near where we get the " [SetHardMinFreq] Set hard min uclk failed!" Maybe this smum_send_msg_to_smc_with_parameter get through where others fail because of the formatting or something? Thanks again Tom B. for all your testing. I'd like to do some tests of my own, but time's just not permitting for me ATM. Hoping to be more free next weekend. :/ -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #91 from ReddestDream --- >It returns 0 on success and -EIO on failure, which is then in turn returned >from vega20_set_fclk_to_highest_dpm_leve. Where did you see the check/retry on >EINVAL? Perhaps -EIO should be -EINVAL? I didn't find check/retry code. It was more just a thought that maybe we could keep vega20_set_uclk_to_highest_dpm_level from just returning despite the error and allowing further initialization to proceed. Even if it crashed, that might be even be helpful since it's not clear if it's the initialization (drm_dev_register) or something else that is silent in the logs that is changing something and causing vega20_set_uclk_to_highest_dpm_level to fail where we know it succeeded so many times before. >I'm not sure this is helpful but I managed to somewhat test the race condition >theory. If there is a race, I'm not sure it's in the time the driver waits for the hardware registers to respond and/or the value to set. But it's still enlightening. At this point it seems more likely that something else we aren't seeing in the logs is breaking vega20_set_uclk_to_highest_dpm_level in the last moments (unlikely due to the dpm_state.hard_min_level value), it falls through and drm_dev_register runs and initialization message prints. amdgpu doesn't consider the "[SetUclkToHightestDpmLevel] Set hard min uclk failed!" to be a significant enough error to stop initialization. But maybe it should . . . -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #90 from Tom B --- I'm not sure this is helpful but I managed to somewhat test the race condition theory. If you follow the callstack: vega20_set_fclk_to_highest_dpm_level -> smum_send_msg_to_smc_with_parameter -> vega20_send_msg_to_smc_with_parameter -> vega20_wait_for_response -> phm_wait_for_register_unequal you find this code in smu_helper.c: int phm_wait_on_register(struct pp_hwmgr *hwmgr, uint32_t index, uint32_t value, uint32_t mask) { uint32_t i; uint32_t cur_value; if (hwmgr == NULL || hwmgr->device == NULL) { pr_err("Invalid Hardware Manager!"); return -EINVAL; } for (i = 0; i < hwmgr->usec_timeout; i++) { cur_value = cgs_read_register(hwmgr->device, index); if ((cur_value & mask) == (value & mask)) break; udelay(1); } /* timeout means wrong logic*/ if (i == hwmgr->usec_timeout) return -1; return 0; } The timeout there is interesting. I increased it. for (i = 0; i < hwmgr->usec_timeout*10; i++) { cur_value = cgs_read_register(hwmgr->device, index); if ((cur_value & mask) == (value & mask)) break; udelay(1); } The PC takes significantly longer to boot (10 or so seconds when it's usually instant) and the error still occurs. So I'm not sure it's just a matter of waiting. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #89 from Tom B --- > It should return -EINVAL instead. Maybe then it would reset and try again > instead of just ignoring it and continuing with initialization anyway, > leading to instability. If you look at vega20_send_msg_to_smc_with_parameter: static int vega20_send_msg_to_smc_with_parameter(struct pp_hwmgr *hwmgr, uint16_t msg, uint32_t parameter) { struct amdgpu_device *adev = hwmgr->adev; int ret = 0; vega20_wait_for_response(hwmgr); WREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90, 0); WREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_82, parameter); vega20_send_msg_to_smc_without_waiting(hwmgr, msg); ret = vega20_wait_for_response(hwmgr); if (ret != PPSMC_Result_OK) pr_err("Failed to send message 0x%x, response 0x%x\n", msg, ret); return (ret == PPSMC_Result_OK) ? 0 : -EIO; } It returns 0 on success and -EIO on failure, which is then in turn returned from vega20_set_fclk_to_highest_dpm_leve. Where did you see the check/retry on EINVAL? Perhaps -EIO should be -EINVAL? -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #88 from ReddestDream --- >The question then becomes: Why doesn't the race condition happen with only one >screen? Perhaps it's a matter of speed. With a single display, the driver >detect the displays, read/parse the EDID data, initialize in time. But then >that doesn't explain why the crash still occurs if you boot with one >DisplayPort monitor and attach another after X is running. I do suspect it's a matter of speed and complexity when you have more monitors. Also maybe the clock it tries to set (the value of hard_min_level) is different if you only have one monitor and somehow that takes more time (resetting it away from some default). I do wonder if maybe in: "[SetUclkToHightestDpmLevel] Set hard min uclk failed!", return ret); It should return -EINVAL instead. Maybe then it would reset and try again instead of just ignoring it and continuing with initialization anyway, leading to instability. >One thing I've been trying to work out is the difference between vega21_ppt.c >and vega20_hwmgr.c is, as they both contain slightly different or identical >versions of the same functions. It looks like the functions in vega20_hwmgr.c >take precedence but it's strange to see this duplication and both files are >worked on in the commit history. Hmm. That is interesting. I'll take a look. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #87 from Tom B --- > Could be we've got a race condition between the powerplay setup and amdgpu handing off the card to drm_dev_register to advertise it for normal use. The question then becomes: Why doesn't the race condition happen with only one screen? Perhaps it's a matter of speed. With a single display, the driver detect the displays, read/parse the EDID data, initialize in time. But then that doesn't explain why the crash still occurs if you boot with one DisplayPort monitor and attach another after X is running. One thing I've been trying to work out is the difference between vega21_ppt.c and vega20_hwmgr.c is, as they both contain slightly different or identical versions of the same functions. It looks like the functions in vega20_hwmgr.c take precedence but it's strange to see this duplication and both files are worked on in the commit history. Take a look at vega20_set_uclk_to_highest_dpm_level and vega20_apply_clocks_adjust_rules in both for examples. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #86 from ReddestDream --- >In addition to that, vega20_set_fclk_to_highest_dpm_level is called several >times before the card is initialized and even on 5.2.7 works. Something >happens during or just before the initialization stage that stops >smum_send_msg_to_smc_with_parameter accepting 1001 as a valid value, as it >does until that point. Could be we've got a race condition between the powerplay setup and amdgpu handing off the card to drm_dev_register to advertise it for normal use. drm_dev_register is responsible for the "[drm] Initialized" message: https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/drm_drv.c#L994 And it seems like amdgpu calls it here: https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L1054 Odd that it's doing this if powerplay still has more work to do. And that might be why vega20_set_uclk_to_highest_dpm_level fails that last time. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #85 from Tom B --- > Yeah. I've had boots where I have my 2 4K DP monitors in and I don't get > powerplay error on boot. In fact, it can go a bit and seem stable. In addition to that, vega20_set_fclk_to_highest_dpm_level is called several times before the card is initialized and even on 5.2.7 works. Something happens during or just before the initialization stage that stops smum_send_msg_to_smc_with_parameter accepting 1001 as a valid value, as it does until that point. I think you're right about BACO, it was worth looking at but I applied a quick hack to ensure it's disabled: int vega20_baco_set_state(struct pp_hwmgr *hwmgr, enum BACO_STATE state) { return 0; } int vega20_baco_get_capability(struct pp_hwmgr *hwmgr, bool *cap) { *cap = false; return 0; } No difference, I still get the errors and wrong wattage so unless BACO is somehow on by default and only turned off in the proper version of this code, we can rule it out. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #84 from ReddestDream --- >Need to figure out what exactly what is generating the line "[drm] Initialized >amdgpu 3.27.0 20150101 for :44:00.0 on minor 0." That "Initialized amdgpu" message seems to be coming from here: https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/drm_drv.c#L994 -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #83 from ReddestDream --- > Here's what I found: The value of hard_min_level is 1001 in both 5.0.13 and > 5.2.7 so the issue is not the value from the dpm table. The dpm table is > probably correct. Fantastic! Glad you tested this. I had suspected the hard_min_level was bogus and that's why it was failing. Card was rejecting the bogus value. Glad to know that's not the case. > However, what is interesting is that it doesn't always fail. Yeah. I've had boots where I have my 2 4K DP monitors in and I don't get powerplay error on boot. In fact, it can go a bit and seem stable. But then the powerplay errors suddenly (not related to some high load on the card) start showing up again and the graphics become unstable. Similarly others have reported that on hotplugging a second monitor after boot, the powerplay errors will start showing up. So, maybe there is a timing problem involved with sending the message. It's generally a question of when rather than if it's going to fail. > 1. vega20_set_fclk_to_highest_dpm_level is called twice between the "ring > vce2" line and "Initialized" Is it always called twice? Even on 5.2.7? Because it looks like it might get called two times right before "Initialized" on 5.0.13 but then only once on 5.2.7 before "Initialized" kicks in. Maybe "Initialized" is interrupting on 5.2.7 but not on 5.0.13. It's possible that Initialization of the card is messing up values that powerplay needs to read off the card or making the card unavailable for receiving messages or something . . . > So initialization is happening between (and possibly a result of) sending the > message and getting the response Yeah. Something is definitely happening while vega20_set_uclk_to_highest_dpm_level is running . . . Not 100% sure that's really problematic tho . . . But it could be an atomicity issue. Need to figure out what exactly what is generating the line "[drm] Initialized amdgpu 3.27.0 20150101 for :44:00.0 on minor 0." Looks like it's coming from the drm core rather than amdgpu specifically. > I'm going to see if I can disable/revert BACO entirely to at least rule it > out. I thought BACO was reverted for Vega 20 here: https://github.com/torvalds/linux/commit/7db329e57b90ddebcb58fc88eedbb3082d22a957#diff-8a4d25be8ad5d9c3ff27bb54b678dab2 Your commit seems to have been introduced in 5.2-rc1, not 5.1. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #82 from Tom B --- In addition, I will note that the file vega20_baco.c has been added in 5.1 details: https://www.phoronix.com/scan.php?page=news_item=AMD-Vega-12-BACO commit: https://github.com/torvalds/linux/commit/0c5ccf14f50431d0196b96025c878ae9f45676a9#diff-c2d82e6f1326b5b4e0a09c9cb42cbcc2 This seems like quite a large change, and requires a special "workaround" for Vega 20. Unfortunately, this seems like quite a large code restructure in the driver as I cannot just revert that single commit. I mention this because part of the problem I am seeing is with the wrong wattage. I wonder whether BACO wrongly tries to turn off a part of the card that is required for a secondary monitor and as such puts the card in an invalid state. I'm going to see if I can disable/revert BACO entirely to at least rule it out. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #81 from Tom B --- Created attachment 145038 --> https://bugs.freedesktop.org/attachment.cgi?id=145038=edit 5.2.7 dmesg with hard_min_level logged As mentioned in the previous post, I started logging the value of hard_min_level. I hadn't realised that vega20_set_uclk_to_highest_dpm_level would be called so many times. Here's what I found: The value of hard_min_level is 1001 in both 5.0.13 and 5.2.7 so the issue is not the value from the dpm table. The dpm table is probably correct. Something prevents smum_send_msg_to_smc_with_parameter accepting the value. However, what is interesting is that it doesn't always fail. [4.082105] amdgpu: [powerplay] hard_min_level: 1001 [4.372684] [drm] Initialized amdgpu 3.32.0 20150101 for :44:00.0 on minor 0 [4.517204] amdgpu: [powerplay] Failed to send message 0x28, response 0x0 [4.517205] amdgpu: [powerplay] [SetUclkToHightestDpmLevel] Set hard min uclk failed! Each hard_min_level line in the log is from vega20_set_uclk_to_highest_dpm_level and there are multiple calls to it, which don't fail, before the card is initialised. This is from 5.2.7: [3.698907] amdgpu :44:00.0: ring vce2 uses VM inv eng 14 on hub 1 [4.082105] amdgpu: [powerplay] hard_min_level: 1001 [4.372684] [drm] Initialized amdgpu 3.32.0 20150101 for :44:00.0 on minor 0 [4.517204] amdgpu: [powerplay] Failed to send message 0x28, response 0x0 [4.517205] amdgpu: [powerplay] [SetUclkToHightestDpmLevel] Set hard min uclk failed! [5.361482] amdgpu: [powerplay] Failed to send message 0x28, response 0x0 And the same from 5.0.13: [3.352380] amdgpu :44:00.0: ring vce2 uses VM inv eng 14 on hub 1 [3.722422] amdgpu: [powerplay] hard_min_level: 1001 [3.766269] amdgpu: [powerplay] hard_min_level: 1001 [4.029679] [drm] Initialized amdgpu 3.27.0 20150101 for :44:00.0 on minor 0 There are a couple of things here: 1. vega20_set_fclk_to_highest_dpm_level is called twice between the "ring vce2" line and "Initialized" 2. My patched code looks like this: pr_err("hard_min_level: %d\n", dpm_table->dpm_state.hard_min_level); PP_ASSERT_WITH_CODE(!(ret = smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_SetHardMinByFreq, (PPCLK_UCLK << 16 ) | dpm_table->dpm_state.hard_min_level)), "[SetUclkToHightestDpmLevel] Set hard min uclk failed!", return ret); Yet the log shows: - My debug line - Initialized amdgpu 3.32.0 20150101 for :44:00.0 on minor 0 - [SetUclkToHightestDpmLevel] Set hard min uclk failed! So initialization is happening between (and possibly a result of) sending the message and getting the response. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #80 from Tom B --- > I tried something like that before but a huge portion of the commits in that > range won't build kernels that can boot (at least on my system). I ended up > resorting to trying reverting individual vega20-affecting commits out of > 5.1. See my results far above in the thread (though someone else willing to > spend more time doing a deeper analysis of the code could probably take my > approach much further). That's why my focus has been finding places in the code where something different happens based on the number of displays. Though this may be a futile avenue of exploration as it could just be an issue of additional memory bandwith requirements or even something that should be done differently with 2 displays that isn't. > It does make me wonder if it's worth testing like 2 simple 1080p 60 Hz > displays. Maybe that wouldn't trigger this issue. Not that that would really > be of use to me. But it might help distinguish between just monitor detect > generally being broken and "high monitor load" being broken . . . This would be an interesting test but I think 1080p 60hz monitors with displayport are fairly uncommon and I don't have any to test with. My guess is anyone with a Radeon VII, a high end card with 16gb VRAM, is likely to have a high end display which could equally explain why there are no reports here of people running 1080p 60hz displays. My next test is going to be logging dpm_table->dpm_state.hard_min_level on line 3354 (just before it's sent to the smc) on both 5.0.13 and 5.2.7 to see if the same hard_min_level value is sent to the smc on both kernels. This will at least let us know whether it's something that's incorrectly setting hard_min_level or something that prevents the smc accepting the value. My hunch from my previous tests is that it's the latter but I'll try it and report back. I know nothing about driver development so I have no idea how this stuff should work, I can only compare the differences between 5.0.13 and later kernels. Anyway, thanks everyone for your input. Any information, even on things that you tried and didn't work, is valuable as it can help us narrow down the problem. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #79 from ReddestDream --- >I tried something like that before but a huge portion of the commits in that >range won't build kernels that can boot (at least on my system). It's interesting that you found d1a3e239a6016f2bb42a91696056e223982e8538 to improve the issue: https://github.com/torvalds/linux/commit/d1a3e239a6016f2bb42a91696056e223982e8538#diff-0bc07842bc28283d64ffa6dd2ed716de >From Tom B.'s and my review of the code, it seems very likely that somehow a failure to set a hard minimum properly is at the heart of the issue. >This brings me to the second thing: When looking through the commits, I >noticed that there were multiple commits that claim to prevent or reduce >crashing in high-resolution situations (one references 5k displays, another >references 3+ 4k displays). Yeah. I have 2 4K displays as well. But I don't think it should really be straining the card. These commits are probably overzealous for Radeon VII. Rather it could be that at least part of the issue, especially the excessive power draw at idle, is just due to these commits artificially setting minimums very high. In fact, that could be why it's stable at all with just one monitor, since the code to set the minimums up is only being triggered when there are more monitors connected. I'd suspect a boottime configuration issue too, but others have reported instability even when the monitors are hotplugged later on. So, it seems like maybe the monitor detect might at least partially be okay, but the follow-through with raising the clock minimums is broken. I suspect the issue is in the code calculating the minimum to set, so the driver gets stuck trying to send incomplete/incorrect values to the card. https://bbs.archlinux.org/viewtopic.php?id=247733 It does make me wonder if it's worth testing like 2 simple 1080p 60 Hz displays. Maybe that wouldn't trigger this issue. Not that that would really be of use to me. But it might help distinguish between just monitor detect generally being broken and "high monitor load" being broken . . . -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #78 from Chris Hodapp --- > I don't see anywhere else to go but bisection from 5.0.13 to 5.1. That should > at least find something . . . I tried something like that before but a huge portion of the commits in that range won't build kernels that can boot (at least on my system). I ended up resorting to trying reverting individual vega20-affecting commits out of 5.1. See my results far above in the thread (though someone else willing to spend more time doing a deeper analysis of the code could probably take my approach much further). -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #77 from ReddestDream --- >I guess, you are good for a bisection if you have a "working" kernel. This is, based on everything here, I'm not convinced that 5.0.13 has 0 issues. Only that it seems to have fewer issues. But yeah. I don't see anywhere else to go but bisection from 5.0.13 to 5.1. That should at least find something . . . -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #76 from Sylvain BERTRAND --- > Unfortunately, it does look like going through and slowing disabling features > and/or bisecting might be the only way to find how this issue got started. At > least if we could narrow it down, we might be in better shape. :/ I guess, you are good for a bisection if you have a "working" kernel. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #75 from ReddestDream --- >Here's some additional investigation. >[SetUclkToHightestDpmLevel] Set hard min uclk failed! Appears as one of the >first errors in dmesg. This is from vega20_hwmgr.c:3354 and triggered by: I agree that [SetUclkToHightestDpmLevel] is probably the key to all this as it always seems to be the first thing that fails after dysregulation occurs. The "Failed to send message 0x28, response 0x0" errors show that the driver is sending wrong or at least wrongly timed commands to the GPU that eventually cascade into complete failure. >Again, it didn't help. I will note that this code is identical in 5.0.13 I have also been unable to find changed code since 5.0 that could be directly connected to display detect/init/enumeration issues on Radeon VII/Vega 20. This is why I've come to suspect the error is triggered indirectly in a way that will probably not be obvious and by code that was likely flawed from the beginning of Radeon VII/Vega 20 support. This is also why I was hopeful that 5.3-rc2 would fix this issue since it has commits that do seem to affect display detection on AMD GPUs. Alas, it did not. :( >If the GPU did not crash with dpm disabled as a whole, the proper way to proceed would be to start from there and step by step add dpm features and see when it starts crashing. It's not a small task since dpm code paths may be scattered all over the code. Unfortunately, it does look like going through and slowing disabling features and/or bisecting might be the only way to find how this issue got started. At least if we could narrow it down, we might be in better shape. :/ I must admit I don't have much experience with graphics drivers and when I tell other people about this issue, they immediately want to blame X or Mesa until I explain that I can get these errors w/o starting any graphics at all. lol. In any case, I really appreciate your testing Tom B. And any advice you might have on debugging, Sylvain BERTRAND, is greatly appreciated. :) -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
https://bugs.freedesktop.org/show_bug.cgi?id=110674 --- Comment #74 from Sylvain BERTRAND --- Forcing the memory clock and voltage is not enough: the dc[en]x memory requests should be given also the highest priority in the arbiter block. I don't recall how it interacts with the dc[en]x watermarks, but they should be "disabled" or "maxed out". Basically, whatever the 3D/compute/(vcn|vce/uvd) load, the dc[en]x will always come first (due to the realtime nature of display data transmission to monitors). Oh and of course, the smu/smc should not manage the dc[en]x. Very probably, there are some smc/smu commands to do that. If the GPU did not crash with dpm disabled as a whole, the proper way to proceed would be to start from there and step by step add dpm features and see when it starts crashing. It's not a small task since dpm code paths may be scattered all over the code. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel