[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-11-26 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #172 from line...@xcpp.org ---
I had dpm=2 as a module option. GPU initialization failure does not occur
without dpm=2

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-11-26 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

Alex Deucher  changed:

   What|Removed |Added

 Attachment #146026|text/x-log  |text/plain
  mime type||

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-11-26 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #171 from line...@xcpp.org ---
Created attachment 146026
  --> https://bugs.freedesktop.org/attachment.cgi?id=146026=edit
5.4.0-arch1-1 GPU initialization fails

With kernel version 5.4.0-arch1-1 the GPU can flat out no longer be
initialized.

My system is now completely unusable with the current kernel.

Does this specifically mean anything?
[   15.575361] amdgpu: [powerplay] smu driver if version = 0x0013, smu fw
if version = 0x0012, smu fw version = 0x00282d00 (40.45.0)
[   15.575362] amdgpu: [powerplay] SMU driver if version not matched

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-11-10 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #170 from Peter Hercek  ---
Maybe this helps since there is a stack trace. GUI stopped to respond so I shut
it down over ssh. A kernel crash during the shutdown on 5.3.6-arch1-1-ARCH even
when amdgpu.dpm=0. That is the option which is supposed to work. It has both
the patch and also amdgpu.dpm=0.

Nov 04 17:38:58 phnm kernel: [ cut here ]
Nov 04 17:38:58 phnm kernel: WARNING: CPU: 6 PID: 640 at
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:5804
amdgpu_dm_atomic_commit_tail.cold+0x82/0xed [amdgpu]
Nov 04 17:38:58 phnm kernel: Modules linked in: fuse xt_CHECKSUM xt_MASQUERADE
xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat
iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
libcrc32c ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter tun
bridge cfg80211 rfkill 8021q garp mrp stp llc intel_rapl_msr intel_rapl_common
amdgpu x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel hid_microsoft
radeon mousedev input_leds joydev ff_memless kvm gpu_sched
snd_hda_codec_realtek snd_hda_codec_generic i2c_algo_bit irqbypass
ledtrig_audio ttm crct10dif_pclmul snd_hda_intel crc32_pclmul hid_generic
ghash_clmulni_intel cdc_acm drm_kms_helper snd_hda_codec aesni_intel usbhid
iTCO_wdt iTCO_vendor_support snd_hda_core wmi_bmof aes_x86_64 hid crypto_simd
cryptd mxm_wmi snd_hwdep glue_helper drm intel_cstate snd_pcm agpgart r8169
syscopyarea intel_uncore sysfillrect realtek sysimgblt snd_timer pcspkr
i2c_i801 fb_sys_fops e1000e intel_rapl_perf
Nov 04 17:38:58 phnm kernel:  mei_me snd libphy mei soundcore lpc_ich wmi evdev
mac_hid sg ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2
crc32c_intel firewire_ohci xhci_pci xhci_hcd firewire_core ehci_pci crc_itu_t
ehci_hcd sr_mod cdrom sd_mod ahci libahci libata scsi_mod
Nov 04 17:38:58 phnm kernel: CPU: 6 PID: 640 Comm: Xorg Not tainted
5.3.6-arch1-1-ARCH #1
Nov 04 17:38:58 phnm kernel: Hardware name: System manufacturer System Product
Name/P9X79, BIOS 4502 10/15/2013
Nov 04 17:38:58 phnm kernel: RIP:
0010:amdgpu_dm_atomic_commit_tail.cold+0x82/0xed [amdgpu]
Nov 04 17:38:58 phnm kernel: Code: c7 c7 08 1e db c0 e8 0f 59 a0 db 0f 0b 41 83
7c 24 08 00 0f 85 92 ff f1 ff e9 ad ff f1 ff 48 c7 c7 08 1e db c0 e8 f0 58 a0
db <0f> 0b e9 32 f5 f1 ff 48 8b 85 00 fd ff ff 4c 89 f2 48 c7 c6 0d 0f
Nov 04 17:38:58 phnm kernel: RSP: 0018:a98c410475a0 EFLAGS: 00010046
Nov 04 17:38:58 phnm kernel: RAX: 0024 RBX: 894125e06000 RCX:

Nov 04 17:38:58 phnm kernel: RDX:  RSI: 0003 RDI:

Nov 04 17:38:58 phnm kernel: RBP: a98c410478c0 R08: 16b622fb648e R09:
9deb3254
Nov 04 17:38:58 phnm kernel: R10: 0616 R11: 0001d890 R12:
0286
Nov 04 17:38:58 phnm kernel: R13: 8940f30b0400 R14: 894129c2 R15:
894075ba6a00
Nov 04 17:38:58 phnm kernel: FS:  7fbf9c35c500()
GS:89413fb8() knlGS:
Nov 04 17:38:58 phnm kernel: CS:  0010 DS:  ES:  CR0: 80050033
Nov 04 17:38:58 phnm kernel: CR2: 559991d31420 CR3: 00082a644002 CR4:
000606e0
Nov 04 17:38:58 phnm kernel: Call Trace:
Nov 04 17:38:58 phnm kernel:  ? commit_tail+0x3c/0x70 [drm_kms_helper]
Nov 04 17:38:58 phnm kernel:  commit_tail+0x3c/0x70 [drm_kms_helper]
Nov 04 17:38:58 phnm kernel:  drm_atomic_helper_commit+0x108/0x110
[drm_kms_helper]
Nov 04 17:38:58 phnm kernel:  drm_client_modeset_commit_atomic+0x1e8/0x200
[drm]
Nov 04 17:38:58 phnm kernel:  drm_client_modeset_commit_force+0x50/0x150 [drm]
Nov 04 17:38:58 phnm kernel:  drm_fb_helper_pan_display+0xc2/0x200
[drm_kms_helper]
Nov 04 17:38:58 phnm kernel:  fb_pan_display+0x83/0x100
Nov 04 17:38:58 phnm kernel:  fb_set_var+0x1e8/0x3d0
Nov 04 17:38:58 phnm kernel:  fbcon_blank+0x1dd/0x290
Nov 04 17:38:58 phnm kernel:  do_unblank_screen+0x98/0x130
Nov 04 17:38:58 phnm kernel:  vt_ioctl+0xeff/0x1290
Nov 04 17:38:58 phnm kernel:  tty_ioctl+0x37b/0x900
Nov 04 17:38:58 phnm kernel:  ? preempt_count_add+0x68/0xa0
Nov 04 17:38:58 phnm kernel:  do_vfs_ioctl+0x43d/0x6c0
Nov 04 17:38:58 phnm kernel:  ? syscall_trace_enter+0x1f2/0x2e0
Nov 04 17:38:58 phnm kernel:  ksys_ioctl+0x5e/0x90
Nov 04 17:38:58 phnm kernel:  __x64_sys_ioctl+0x16/0x20
Nov 04 17:38:58 phnm kernel:  do_syscall_64+0x5f/0x1c0
Nov 04 17:38:58 phnm kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nov 04 17:38:58 phnm kernel: RIP: 0033:0x7fbf9d7b425b
Nov 04 17:38:58 phnm kernel: Code: 0f 1e fa 48 8b 05 25 9c 0c 00 64 c7 00 26 00
00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 9b 0c 00 f7 d8 64 89 01 48
Nov 04 17:38:58 phnm kernel: RSP: 002b:7ffe21162798 EFLAGS: 0246
ORIG_RAX: 0010
Nov 04 17:38:58 phnm kernel: RAX: ffda RBX: 55d93ebf5180 RCX:
7fbf9d7b425b
Nov 04 17:38:58 

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-11-10 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #169 from picar...@live.de ---
I am using a Radeon VII with Arch Linux, a 1440p144hz and a 4K60Hz monitor, and
I had similar crashes to the others here if I tried running the 1440p144hz
monitor at 144hz, at 60hz it was stable. This behavior stayed all the way from
kernel 5.0 up to 5.3, and only stopped when I started using kernel 5.4.0
(5.4.0-rc6-mainline right now). Now I can run it at 144hz without crashes.

The driver still isn't working that well, as games seem very stuttery, but at
least it doesn't crash anymore.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-10-21 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #168 from line...@xcpp.org ---
Created attachment 145784
  --> https://bugs.freedesktop.org/attachment.cgi?id=145784=edit
5.3.7: Fence fallback timer expired on ring 

Here is a freeze which went a bit differently. 
This time the system is frozen without any blinking and there are tons of
messages like:

[ 2940.919451] [drm] Fence fallback timer expired on ring page1

This is on 5.3.7-arch1-1

(Also I'm using only one single monitor connected through DP, as opposed to the
others)

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-10-20 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #167 from Alex Deucher  ---
(In reply to Peter Hercek from comment #166)
> I got the crash after 4 days of use. It looks the same as before:
> ring sdma0 timeout, gpu reset (allegedly successful), many skipped IBs, and
> failure to initialize parser for ever.

The parser error just means you need to restart your desktop environment.  At
the moment no desktop managers properly handle GPU resets (recreate their
context and buffers) so you need to restart your desktop to get it back.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-10-19 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #166 from Peter Hercek  ---
I tried, 5.3.6-arch1-1 on archlinux with 3 DP monitors. It should contain the
patch based on the comment from line...@xcpp.org.

I got the crash after 4 days of use. It looks the same as before:
ring sdma0 timeout, gpu reset (allegedly successful), many skipped IBs, and
failure to initialize parser for ever.

The situation looked like this from my experience: with each new kernel the
error got worse and worse; 5.3.6 improved it a lot, but it is still not fixed.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-10-14 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #165 from Tom B  ---
I just tried 5.3.5 (which is the latest in the arch repo) and it's working fine
for me.

I do have an issue on Wayland. If the screen turns off, Wayland crashes and I
have to hard reset. The log shows 

Oct 14 17:48:56 desktop kernel: amdgpu: [powerplay] [SetHardMinFreq] Set hard
min uclk failed!
Oct 14 17:49:02 desktop kernel: amdgpu: [powerplay] Failed to send message
0x26, response 0x0
Oct 14 17:49:02 desktop kernel: amdgpu: [powerplay] Failed to set soft min
gfxclk !
Oct 14 17:49:02 desktop kernel: amdgpu: [powerplay] Failed to upload DPM Bootup
Levels!


But, this also shows on boot so I'm not sure it's a problem and it seems to be
wayland that segfaults, not an issue with amdgpu. 

I do still get `kernel: [drm] schedsdma0 is not ready, skipping` repeating
forever in my journal.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-10-14 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #164 from line...@xcpp.org ---
(In reply to Tom B from comment #163)
> Gargoyle, linedot, can you confirm whether this crash is with both patches
> applied?
> 
> I'm still on 5.3.1 patched and haven't had a single crash.

For 5.3.1 I've built the kernel with the arch build system and manually added
lines to apply the two patches to PKGBUILD and also have seen them being
applied in the log.

For 5.3.6 I've checked that the patches are already applied.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-10-14 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #163 from Tom B  ---
Gargoyle, linedot, can you confirm whether this crash is with both patches
applied?

I'm still on 5.3.1 patched and haven't had a single crash.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-10-14 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #162 from line...@xcpp.org ---
Created attachment 145730
  --> https://bugs.freedesktop.org/attachment.cgi?id=145730=edit
Freeze/Black screen/Crash on 5.3.6

Apologies, I have been on vacation and thus away from my main System.

Attached is the dmesg log of another crash with kernel version 5.3.6. Here is a
description of what the crash looked like:
1) Successfully booted up to login manager
2) Logged into a graphical session
3) Shortly after, the screen freezes
4) Screen flashes to black (~5-10 sec)
5) Screen flashes back to the frozen desktop (~5-10 sec)
6) Screen goes black (not off), no response to input, switching to tty doesn't
work. I was able to ssh into the machine from a laptop and get the dmesg
output.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-10-12 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #161 from Gargoyle  ---
Hi there. I've been trying to solve some lockups and pauses with my system and
have just read this entire thread. 

The good news is that I am another Radeon VII owner having the same problems
and I am willing to do whatever I can to help.

My current situation is:-

- I'm running dual 2560x1440@60Hz via display port.

- I am running the beta of ubuntu 19:10 (Linux ryzen1910 5.3.0-18-generic
#19-Ubuntu SMP Tue Oct 8 20:14:06 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux).

- I don't push the R:VII at all under Linux. I boot into Windows 10 to play
games.

- I have disabled IOMMU in BIOS/EFI. With IOMMU enabled things are MUCH worse.

- My system is mostly stable. If the displays blank, sometimes after waking
them I get the 15-30 second freeze. Then the "amdgpu [powerplay] Failed..."
messages and then everything continues ok. I can semi-reliably recreate this by
using the "xset dpms force off" command someone posted earlier. I've not
managed to find any kind of pattern yet, but 8 out of 10 times running that
command and then waking the system with a keypress/mouse click will cause the
freeze.

- I use X11 and not wayland. Not sure that is significant, but with Ubuntu
19:10 it seems wayland is started temporarily and then stopped during boot /
starting gdm. If I enable IOMMU my GDM login screen will be completely corrupt.
However, if I press enter (to select my user) and enter my password, my X11
gnome session starts. Although there are LOTS of pauses and warnings and errors
all over the place in "journalctl -f".

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-10-10 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #160 from ReddestDream  ---
Well, today I had a hard freeze using more than one display with Radeon VII.
Back to Radeon VII + iGPU . . . :(

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-10-06 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #159 from ReddestDream  ---
Oh. Also,

cat /sys/kernel/debug/dri/0/amdgpu_pm_info

Now seems to work on 5.3.4 with more than one monitor in. It doesn't report
nonsense values like 0 watts like it did before. :)

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-10-06 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #158 from ReddestDream  ---
More good news. It seems that 5.3.4 does work for me and doesn't (at least
immediately since I'm typing this from there right now) fall apart into a
glitchy mess.

I'm still not really sure of the complete stability of things tho because we do
still see our old friend: "amdgpu: [powerplay] Failed to send message 0x28,
response 0x0, amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed!"
in dmesg. So, AFAICT, there's still something wrong. It's just more stable than
it was before.

But yeah. This is the first time since I've gotten this card that I've been
able to boot to a DE w/o crashing and w/o disabling dpm. :)

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-10-06 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #157 from ReddestDream  ---
@Tom B. Well, some good news. Kernel 5.3.4 should have the patches for Radeon
VII included now. I'll do some more tests on that ...

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-10-06 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #156 from Tom B  ---
This is strange because with a patched 5.3.1, I have perfect stability. An
uptime of over a week and no issues. Are you saying that the issue comes back
in 5.4? Hopefully not as Linux 5.4 + Mesa 19.3 looks to have a nice performance
bump on the VII. 

With the patches, do you see the card boosting correctly? Do the wattage,
voltage and clocks change under load? Asking an obvious question here, but is
the crash temperature related? Maybe the patches increase power and overheat.
If so, it might explain why I'm not affected as my card is water cooled.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-10-04 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #155 from ReddestDream  ---
So, I've done some tests with 5.4-rc1 and it seems like I'm getting similar
results to line...@xcpp.org and sehell...@gmail.com. I'm using GNOME with
Wayland (which works fine with only 1 display). Sometimes it works for a while.
Sometimes I can't see the mouse cursor. Sometimes I get glitches all over the
screen containing pieces and parts of previous framebuffers. But, I mean, it's
better than 5.3 was, which was so bad I never could see anything and I would
get stuck on blackscreen. At least on 5.4-rc1 I've been able to manually switch
to a virtual console and reboot rather than force a reboot with the power
button.

Still hoping for some fix for this, but it's become less important to me as
further improvements to GNOME and MESA have made the Radeon VII + iGPU setup
I've been using run smoother. I've also discovered further issues on Windows
regarding the high memory clock when using multiple monitors with Radeon VII,
and it's been affecting performance there too. I'm considering just sticking
with 1 monitor only with for this machine/card. lol

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-10-03 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #154 from line...@xcpp.org ---
Created attachment 145623
  --> https://bugs.freedesktop.org/attachment.cgi?id=145623=edit
5.4.0-rc1 hangup

dmesg with 5.4.0-rc1.

System freezes and becomes unresponsive to input like before

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-10-01 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #153 from ReddestDream  ---
Just FYI, it appears that kernel 5.3.2 does not have the Vega 20 fix commits
that Alex Deucher mentioned.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-30 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #152 from ReddestDream  ---
Kernel 5.4-rc1, the first kernel version that includes the Vega 20 patches
noted by Alex Deucher, is now out and in linux-mainline on Arch Linux AUR. :)

I plan to do some testing of this version over the next few days, and it might
be worth it for people who are still having issues to confirm on this version
as well. Thanks!

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-29 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #151 from line...@xcpp.org ---
Created attachment 145583
  --> https://bugs.freedesktop.org/attachment.cgi?id=145583=edit
5.3.1 patched, xorg crash

And here is a dmesg of just an X session crashing

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-29 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

line...@xcpp.org changed:

   What|Removed |Added

 Attachment #145581|0   |1
is obsolete||

--- Comment #150 from line...@xcpp.org ---
Created attachment 145582
  --> https://bugs.freedesktop.org/attachment.cgi?id=145582=edit
5.3.1 patched, wayland crash

Sorry, the file got messed up, here is the wayland crash

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-29 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

line...@xcpp.org changed:

   What|Removed |Added

 CC||line...@xcpp.org

--- Comment #149 from line...@xcpp.org ---
Created attachment 145581
  --> https://bugs.freedesktop.org/attachment.cgi?id=145581=edit
5.3.1 plus Alex's patches, kde wayland crash, then kde xorg crash

This issue is not fixed for me with Alex's patches.

I use only a single monitor via DP. Running a patched 5.3.1 kernel. Attached is
a dmesg log: First a wayland KDE session crashes, I kill all user processes and
restart sddm and start a KDE Xorg session, which later also crashes.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-27 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

Anthony Rabbito  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #148 from Anthony Rabbito  ---
Everyone's contribution is very much appreciated ! I can finally go back to
using my workstation. Alex, thank you

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-27 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #147 from ReddestDream  ---
> Already merged to 5.4.  I'll take a look at older kernels as well.

@Alex Deucher Thanks so much for all your help! :)

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-27 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #146 from Alex Deucher  ---
(In reply to tom91136 from comment #145)
> @Alex any plans for the patches to be merged for 5.4 or even backported to
> 5.3 at some point?

Already merged to 5.4.  I'll take a look at older kernels as well.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-24 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #145 from tom91...@gmail.com ---
@Alex any plans for the patches to be merged for 5.4 or even backported to 5.3
at some point?

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-23 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #144 from sehell...@gmail.com ---
I also think this is strange. Since yesterday, they turned off and on many
times successfully without any problems. Most likely, it's connected with
something else, but I don’t know where to find.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-23 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #143 from Tom B  ---
I'm not sure how KDE handles monitor power behind the scenes but I have an
uptime of 2 days now since applying the patches and with KDE I've let it turn
off the monitors at least 6 or 7 times and suspend/resume 3 times without
issue.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-23 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #142 from sehell...@gmail.com ---
(In reply to Alex Deucher from comment #141)
> (In reply to sehellion from comment #140)
> > Created attachment 145463 [details]
> > 5.3.1 with Alex's patches and dual monitors, crash
> 
> That's not a crash, it's just a warning.

But system hangs after. Today it happened twice. When I try to resume work,
monitors turn on, then the secondary shows that there is no signal, and the
primary shows a black screen. But perhaps this is not related to this bug. I
can connect via ssh and see logs when this happens, if necessary.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-23 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #141 from Alex Deucher  ---
(In reply to sehellion from comment #140)
> Created attachment 145463 [details]
> 5.3.1 with Alex's patches and dual monitors, crash

That's not a crash, it's just a warning.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-23 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

Alex Deucher  changed:

   What|Removed |Added

 Attachment #145463|text/x-log  |text/plain
  mime type||

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-22 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

sehell...@gmail.com changed:

   What|Removed |Added

 Attachment #145461|0   |1
is obsolete||

--- Comment #140 from sehell...@gmail.com ---
Created attachment 145463
  --> https://bugs.freedesktop.org/attachment.cgi?id=145463=edit
5.3.1 with Alex's patches and dual monitors, crash

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-22 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #139 from sehell...@gmail.com ---
Today, when trying to wake up the monitors, the system crashed again. 

WARNING: CPU: 4 PID: 32 at
drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc_link_dp.c:1720
decide_link_settings+0xe0/0x2a0 [amdgpu]

Full dmesg log has updated.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-22 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #138 from sehell...@gmail.com ---
Created attachment 145461
  --> https://bugs.freedesktop.org/attachment.cgi?id=145461=edit
5.3.1 with Alex's patches and dual monitors

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-22 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #137 from sehell...@gmail.com ---
(In reply to Alex Deucher from comment #128)
> Do these patches help?
> https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-
> fixes=c46e5df4ac898108da66a880c4e18f69c74f6c1b
> https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-
> fixes=c02d6a161395dfc0c2fdabb9e976a229017288d8

Yes, these patches fix the problem. 

amdgpu: [powerplay] Failed to send message 0x28, response 0x0
amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed!
amdgpu :03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed
on sdma0 (-110).
amdgpu :03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed
on page0 (-110).
amdgpu: [powerplay] Failed to send message 0x26, response 0x0
amdgpu: [powerplay] Failed to set soft min gfxclk !
amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
amdgpu :03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed
on sdma1 (-110).
[drm:process_one_work] *ERROR* ib ring test failed (-110).

In general system is stable.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-22 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #133 from Anthony Rabbito  ---
Created attachment 145459
  --> https://bugs.freedesktop.org/attachment.cgi?id=145459=edit
dsmeg log with Alex's patches

Here's my dsmeg with Alex's patches. Going to mess around and see what I can
find.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-22 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #129 from Tom B  ---
Thank you Alex! That has fixed it! The card is now correctly setting its
voltages and clocks. I applied the patch to 5.3.1

However, I've noticed a few very minor problems that are probably worth
reporting.

1. I still get this in dmesg:


[6.307005] amdgpu: [powerplay] Failed to send message 0x28, response 0x0
[6.307006] amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed!
[9.225192] amdgpu :44:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR*
IB test failed on sdma0 (-110).
[   10.238621] amdgpu :44:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR*
IB test failed on page0 (-110).
[   10.532004] amdgpu: [powerplay] Failed to send message 0x26, response 0x0
[   10.532005] amdgpu: [powerplay] Failed to set soft min gfxclk !
[   10.532006] amdgpu: [powerplay] Failed to upload DPM Bootup Levels!


Though this doesn't really matter, we were focussing our issue there earlier in
the thread as it looked like `Set hard min uclk failed!` was the cause of the
problem, obviously it isn't.

2. This repeats indefinitely in dmesg:

[  332.575747] [drm] schedsdma0 is not ready, skipping
[  332.582657] [drm] schedsdma0 is not ready, skipping
[  332.582864] [drm] schedsdma0 is not ready, skipping
[  332.708848] [drm] schedsdma0 is not ready, skipping
[  332.715975] [drm] schedsdma0 is not ready, skipping
[  332.716229] [drm] schedsdma0 is not ready, skipping
[  332.756987] [drm] schedsdma0 is not ready, skipping
[  332.763970] [drm] schedsdma0 is not ready, skipping
[  332.764169] [drm] schedsdma0 is not ready, skipping


As you can see several dozens of times second this gets written to dmesg. This
might be because the patches are intended to be used on 5.4?

3. The lowest wattage now seems to be 33w rather than 23w which means increased
idle power usage and temps. This isn't really a problem but I thought it was
worth mentioning and is a fair tradeoff for stability.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-22 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #131 from Tom B  ---
In addition to my previous comment, [drm] schedsdma0 is not ready, skipping
repeating indefinitely stops after a suspend/resume. After the machine is
resumed these stop appearing but it does suspend and resume correctly.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-22 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #132 from Anthony Rabbito  ---
Created attachment 145458
  --> https://bugs.freedesktop.org/attachment.cgi?id=145458=edit
linux-mainline5.3 dmesg without patches

Here's my current dmesg with two out of three monitors running without the
patches Alex provided. I'm currently compiling the kernel with his patches to
look at the differences and see if I can get my third monitor to boot up.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-22 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #134 from Anthony Rabbito  ---
Wow ! All three of my monitors are working again. 2560x1440 @ 144Hz

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-22 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #135 from Adrian Brown  ---
@reddestdream Thanks. I don't think the active adapter is the problem as it
works perfectly with my Vega 64. However I will try 18.04 and AMD's driver as
suggested.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-22 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #136 from tom91...@gmail.com ---
Been following this thread for a while now as I just got 3 4k 60Hz monitors
connected to the 3 DP ports on my Radeon VII. 
I'm getting the exact same errors discussed in this report with matching dmesg
outputs.

I've applied the patches to Fedora 31's 5.3.0-3 kernel and everything now works
perfectly!

Just a few notes:

* Idle power draw before patch was 22W in lm_sensors, now it's reading 28W,
makes sense as the memory is now properly clocked. This also loosely matches
@Tom B's results.

* I did not get the repeated `[drm] schedsdma0 is not ready, skipping` in
dmesg, however, it is still possible to trigger a freeze by toggling dpms:

xset dpms force off

Resulting in:

[  155.431068] amdgpu: [powerplay] Failed to send message 0x28, response 0x0
[  155.431070] amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed!
[  161.334003] amdgpu: [powerplay] Failed to send message 0x26, response 0x0
[  161.334004] amdgpu: [powerplay] Failed to set soft min gfxclk !
[  161.334005] amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
[  164.622060] amdgpu: [powerplay] Failed to send message 0x28, response 0x0
[  164.622062] amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed!


Previously, without the patch, the machine hangs. With the patch, the display
freezes for a few seconds and then power off. Mouse movement correctly turns on
all screen and everything is back to normal.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-22 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #130 from Anthony Rabbito  ---
(In reply to Alex Deucher from comment #128)
> Do these patches help?
> https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-
> fixes=c46e5df4ac898108da66a880c4e18f69c74f6c1b
> https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-
> fixes=c02d6a161395dfc0c2fdabb9e976a229017288d8

I will try to apply these patches in a few hours.Though I must say in 5.3
things have been much better. Not perfect and I haven't tried triple monitor
yet, but definitely improvement

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-20 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #128 from Alex Deucher  ---
Do these patches help?
https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes=c46e5df4ac898108da66a880c4e18f69c74f6c1b
https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes=c02d6a161395dfc0c2fdabb9e976a229017288d8

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-20 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #127 from Alex Deucher  ---
(In reply to Tom B from comment #15)
> Have been running 5.0 since release without issue but upgraded this morning
> and got crashes as described here within a few seconds of boot. 
>

Can you bisect between 5.0 and 5.1 and see what commit caused the regression?

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-18 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #126 from ReddestDream  ---
@Adrian Brown Your Linux issue is potentially related to the active adapter.
Have you tried w/o it?

On Windows, the flickering on/around login, at least for me, has been mostly
resolved by using the latest AMD driver + Windows 10 1903 and all the recent
updates. There was a Windows update about a month ago that resolved a lot of
flickering issues by fixing a bug in Windows's 10-bit color support.

Also, if you are using Ubuntu, it might be worth downgrading to 18.04.3 so that
you can use the Radeon Software for Linux Driver:

https://www.amd.com/en/support/graphics/amd-radeon-2nd-generation-vega/amd-radeon-2nd-generation-vega/amd-radeon-vii

Currently, I hear that using AMD's driver + a supported distro is the best way
to get stability out of Radeon VII. And it's something I will probably end up
trying myself if there's no resolution to the issues forthcoming with 5.4,
which will be the new LTS.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-18 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #125 from Adrian Brown  ---
I am also getting frequent crashes with a Radeon VII on Kubuntu 19.10 (kernel
5.0.0-29-generic). I see there is some discussion in this thread about it
possibly being related to multiple monitors. But I don't think that's the case.
I have a single monitor but it is old with only a dual link DVI connection. So
I am using displayport on the GPU but connected to an active adapter to convert
DP to a dual link DVI connection (my monitor is a Dell 3007WFP running at
2560x1600).

I often get crashes soon after boot. They tend to happen in clusters so it
crashes a few times, then stays stable for a short time and then crashes again.
I don't get these crashes on the same system when dual booted into Windows 10
so the hardware itself seems good. 

One thing worth mentioning is that on Windows 10 I occasionally get a black
screen and the monitor goes off for a couple of seconds. It then comes back to
life. Apparently this is not uncommon and the suspicion in the Windows
community is that AMD drivers sometimes crash but Windows recovers (I never had
this with my Vega 64, only with the Radeon VII). It most likely is a completely
different issue of course, but thought it worth mentioning.

Still hoping for a fix at some point. Also happy to help test any fix.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-09-03 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #124 from ReddestDream  ---
Created attachment 145254
  --> https://bugs.freedesktop.org/attachment.cgi?id=145254=edit
Dmesg 5.3-rc7 w/ Two monitors

This issue is still not fixed on 5.3-rc7. I guess we will probably have to wait
until 5.4 (the next LTS) before more people take a look at this issue. :(

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-30 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #123 from ReddestDream  ---
A few interesting fixes that touch vega20_hwmgr.c have rolled in from
drm-fixes:

The first is likely the most interesting for our issues, as it touches
min/maxes (tho only the soft ones it seems). The other two are related to SMU
versions.

https://github.com/torvalds/linux/commit/83e09d5bddbee749fc83063890244397896a1971

https://github.com/torvalds/linux/commit/21649c0b6b7899f4fa3099c46d3d027f60b107ec

https://github.com/torvalds/linux/commit/23b7f6c41d4717b1638eca47e09d7e99fc7b9fd9

I haven't tested them out yet, but it does give me some hope that someone is
still looking at Vega 20/Radeon VII . . .

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-27 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #122 from ReddestDream  ---
Tested 5.3-rc6. Still has the same issues. Only it's maybe actually worse
because I lose display completely when I use amdgpu.dpm=2 w/Radeon VII
multimonitor on 5.3-rc6, whereas on 5.2.9 I just got same/similar errors to
default.

I'm working a kernel fork of 5.3-rc6 where I'm reverting various things and
adding things in from Vega 10/12 and Navi to see if it helps. Haven't compiled
and tested it yet but since I know 5.3-rc6 itself boots, compiles, and
demonstrates the issue I guess it's a good base until 5.3 releases.

https://github.com/ReddestDream/linux

Any ideas anyone has are appreciated.

For now I actually find that amdgpu.dpm=0 with both 4K monitors on Radeon VII
allows for much snappier generic desktop than my previous setup with AMD+iGPU.
It's amazing how well this card runs 4K displays w/o any proper memory clock
management at all. I'm sure the gaming performance would be pretty bad tho, but
I have Windows for that for now . . .

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-25 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #121 from ReddestDream  ---
Some observations:

1. Nothing at all seems to be up with cur_speed and cur_width. They get set
several times in a row in both runs, but the values are all the same in both.

2. I can't really see anything up with msg/parameter either. When I compare
them to each other nothing seems particularly wacky. And we also have an
instance in my AMD+iGPU run where we see msg/parameter after "[drm] Initialized
amdgpu", so the theory that all messages have to be sent before Initialization
is complete must be wrong.

Now the real question is if we can decode what these msg/parameter values mean.
But it looks more likely to me that vega20_hwmgr.c and vega20_ppt.c are just
bugged somewhere (probably in the same way since they seem to be alternate
versions of each) and that the rest of the amdgpu code is (relatively) fine.

I'm thinking we'll have to go through and knock out/debug pretty much
everything in those files until we figure out where the breakage is. That's
about 3000-4000 lines of code in each of those two files tho. So any thoughts
anyone has about where we should start would be helpful. My focus will probably
be on UCLK (since it seems to break first), SCLK (since it gets set to 0 MHz
when there's multiple displays), DCEFCLK, and basically anything else that
smells like it might control the memory clock and/or be affected by multiple
monitors.

Thanks!

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-25 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #120 from ReddestDream  ---
Created attachment 145159
  --> https://bugs.freedesktop.org/attachment.cgi?id=145159=edit
DebugAMDiGPU

Also here is the AMD + iGPU one.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-25 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #119 from ReddestDream  ---
Created attachment 145158
  --> https://bugs.freedesktop.org/attachment.cgi?id=145158=edit
DebugAMD2Monitors

>I don't think I have time to try it today but if anyone is recompiling the 
>code adding
>pr_err("msg: %d / parameter: %d\n", msg, parameter); 
>to this function in smumgr.c would be a useful addition.


So, I've done just this. I also added a speed/width check to
amdgpu_device_get_min_pci_speed_width in amdgpu_device.c to check the values of
cur_speed and cur_width.

I ran two checks with 5.2.9, one with two monitors on Radeon VII and another
with my stable 1 monitor on each Radeon VII and Intel iGPU.

Please find them attached.

Thanks so much for all your help!

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-25 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #118 from ReddestDream  ---
So, this is a crazy idea, but ironically I think it might be getting closer to
the truth.

Tom B. attempted reverting ad51c46eec739c18be24178a30b47801b10e0357, which was
known to cause some issue with an RX 580. He found that doing so fixed the
multimonitor crash but locked the card to the lowest possible memory speed,
which really isn't acceptable.

Perhaps our issue seem is connected to insufficient or improperly calculated
PCIe bandwidth/speed. Speed mismatches can and will cause messages to not go
through to the peripheral. It's also well-known that Radeon VII was originally
a PCIe 4.0 card that AMD locked down to the 3.0 speeds . . .

What if when using multiple monitors and/or higher clock speeds Radeon VII uses
more bandwidth than Linux expects, causing the loss of communication?

Something else I plan to investigate.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-25 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #116 from ReddestDream  ---
Created attachment 145153
  --> https://bugs.freedesktop.org/attachment.cgi?id=145153=edit
dmesgAMD2Monitors

I've been doing a few tests. I looked into and compiled 5.3-rc5 along with
these patches, but nothing seemed to resolve our multimonitor issue. :/

https://phoronix.com/scan.php?page=news_item=AMDGPU-Multi-Monitor-vRAM-Clock

I've also gotten some dmesg output with 5.2.9 with amdgpu.dc_log=1
drm.debug=0x1e log_buf_len=2M. Turns out that amdgpu.dc_log=1 does nothing on
this kernel, but I didn't know this when I ran the tests. The interesting added
data appears to be coming from drm.debug=0x1e.

I have two (physically) identical LG 24UD58-B 4K60 monitors connected via DP.
One test was done with both monitors connected to Radeon VII, and the other was
done using my stable Intel+Radeon VII setup where one monitor is connected to
Radeon VII and the other is connected to the Intel iGPU (HD 630, also via DP at
4K60).

These dmesg dumps were taken with all DMs/DEs/Graphics disabled in order to
limit interference. The system was booted to a text commandline at native
resolution.

Since 5.3 isn't changing anything, I plan to do a recompile of 5.2.9 (or 5.2.10
if it's out for Arch) with the smum_send_msg_to_smc_with_parameter patch
suggested by Tom B.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-17 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #115 from Tom B  ---
I should have noted it earlier, but I had already tried reverting both "golden
values" commits. I've no idea what it does but it didn't fix this crash.

One thing that would be insightful would be logging every call to
smum_send_msg_to_smc_with_parameter and printing out message/parameter:

int smum_send_msg_to_smc_with_parameter(struct pp_hwmgr *hwmgr,
uint16_t msg, uint32_t parameter)
{

This would cause a very busy log but we could see the last successful message
that was sent and with the same log in 5.0.13 see if there are any obvious
differences. It might be that the previous message causes the invalid state so
knowing what that is could lead us towards the solution.

I don't think I have time to try it today but if anyone is recompiling the code
adding

pr_err("msg: %d / parameter: %d\n", msg, parameter); 

to this function in smumgr.c would be a useful addition.

Also, wants to try re-compiling, here's a quick guide for arch:

1. Get the kernel sources using asp as described here:
https://wiki.archlinux.org/index.php/Kernel/Arch_Build_System navigate to the
created linux/repos/core-x86_64 directory. 

2. You will need to run makepkg -s once to get it to download the sources.

3. You can set the kernel version in PKGBUILD: e.g. _srcver=5.2.7-arch1 or
_srcver=5.0.13-arch1

4. If you want to revert one or more commits put it in the prepare() block
before local src:

  echo "$_kernelname" > localversion.20-pkgname

  git revert db64a2f43c1bc22c5ff2d22606000b8c3587d0ec --no-edit
  git revert f5e79735cab448981e245a41ee6cbebf0e334f61 --no-edit

  local src

It will open your editor, if you don't want to use vi set:


5. For making changes to the code you need to make a patch. Open the
src/archlinux-linux directory. The files you're interested in are in
drivers/drm/gpu/drm/amd/powerplay likely hwmgr/vega20_hwmgr.c Make your changes
to the code. You can't just re-run makepkg as it checks out the original
version of the code. After making changes, navigate to the archlinux-linux
directory and run git diff > ../../vii.patch

6. Add your patch to PKGBUILD source: 

source=(
  "$_srcname::git+https://git.archlinux.org/linux.git?signed#tag=v$_srcver;
  config # the main kernel config file
  60-linux.hook  # pacman hook for depmod
  90-linux.hook  # pacman hook for initramfs regeneration
  linux.preset   # standard config files for mkinitcpio ramdisk
  vii.patch
)

7. I've been cheating with makepkg and getting it to skip hash checks as
otherwise you have to generate the sha256sums for each patch you create. This
is an extra step that only slows down testing. To compile/install run makepkg
-si --skipinteg

Because of the way makepkg works, it keeps the compiled code in the src
directory. That means that although the first compile will take a few minutes,
subsequent compiles will be a lot faster as it'll probably only be recompiling
vega20_hwmgr.c

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-16 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #114 from ReddestDream  ---
5. Tom B., it is probably worth getting a full dmesg with your two monitors in
on a relatively new 5.2.x kernel using at least: amdgpu.dc_log=1 drm.debug=0x1e
log_buf_len=2M

And anything else you might think of. Just to try to get more debug info. Thx!

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-16 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #113 from ReddestDream  ---
4. 

> Given that two different versions of the code produce the same result, my 
> hunch is that the problem is B. The card is not in a state where it's able to 
> receive power changes.

Something to consider: In pretty much all the dmesg logs we see, amdgpu
attempts to reset the GPU, sometimes successfully, and yet it still can't
properly message the GPU afterward and we see the same sequence of failures
starting with "amdgpu: [powerplay] Failed to send message 0x28, response 0x0
amdgpu: [powerplay] [SetUclkToHightestDpmLevel] Set hard min uclk failed!"

Eventually we start to see: "[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to
initialize parser -125!"

This comes from:

https://github.com/torvalds/linux/commits/master/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c

I'm not sure what the -125 error code indicates. My guess is ECANCELED
(Operation Cancelled) as the negated error code 125.

https://github.com/torvalds/linux/blob/master/include/uapi/asm-generic/errno.h

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-16 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #112 from ReddestDream  ---
More ideas:

3. Looking through the crash in sehellion's comment 45:

gfx_v9_0_ring_test_ring+0x19e/0x230 [amdgpu]
amdgpu_ring_test_helper+0x1e/0x90 [amdgpu]
gfx_v9_0_hw_fini+0x299/0x690 [amdgpu]
amdgpu_device_ip_suspend_phase2+0x6c/0xa0 [amdgpu]
amdgpu_device_ip_suspend+0x44/0x80 [amdgpu]
amdgpu_device_pre_asic_reset+0x1ef/0x204 [amdgpu]
amdgpu_device_gpu_recover+0x7b/0x7a3 [amdgpu]
amdgpu_job_timedout+0xfc/0x120 [amdgpu]

We see gfx_v9_0_ring_test and gfx_v9_0_hw_fini which both come from:

https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c

There's a 5.1-rc1 commit in this file pertaining to a "wave ID mismatch" that
could cause deadlocks.

https://github.com/torvalds/linux/commit/41cca166cc57e75e94d888595a428d23a3bf4e36

Along with updated "golden values" for Vega in 5.1-rc1:

https://github.com/torvalds/linux/commit/919a94d8101ebc29868940b580fe9e9811b7dc86

https://github.com/torvalds/linux/commit/f7b1844bacecca96dd8d813675e4d8adec02cd66

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-16 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #111 from ReddestDream  ---
A few other ideas to ponder:

1. Looking into DPM, I found this commit for 5.1-rc1 that looks interesting:

https://github.com/torvalds/linux/commit/7ca881a8651bdeffd99ba8e0010160f9bf60673e

Looks like it exposes "ppfeatures" interface on Vega 10 and later GPU,
including some code for Vega 20.

2. I also found two interesting commits that pertain to "doorbell" register
initialization on Vega 20. Also from 5.1-rc1. Might be related to setting up
the GPU ASICs . I must admit I'm not exactly sure what these do . . .

https://github.com/torvalds/linux/commit/fd4855409f6ebe015406cd2b2ffa4fee4cd1f4a7

https://github.com/torvalds/linux/commit/828845b7c86c5338f6ca02f4b525718f31b2

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-16 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #110 from ReddestDream  ---
> 1. The functions in vega20_ppt.c are used with this new patch so that answers 
> my question from earlier, that's what this file is for and why it contains 
> similar/identical functions.

I was hoping this was the case as the duplicated functions were confusing me
too. Glad we got this figured out! :)

> I tried it, it didn't help the crashing issue and I was stuck at 30w. As soon 
> as I started sddm the system froze. I've attached my dmesg from amdgpu.dpm=2 
> boot. It doesn't fix the issue but it does help answer a few questions I had:

This is disappointing tho. I was hoping that setting amdgpu.dpm=2 would use the
more "actively developed" path and that would fix the issue. :/

> Given that two different versions of the code produce the same result, my 
> hunch is that the problem is B. The card is not in a state where it's able to 
> receive power changes.

I tend to agree, but it's still not clear why or how the card ends up in a bad
state when commands to it via smu_send_smc_msg_with_param seem to just suddenly
stop working. And given the amount of same/similar functions in vega20_hwmgr.c
and vega20_ppt.c it's hard to rule out A entirely.

Since amdgpu.dpm=0 resolves the issue (albeit at the cost of being stuck at
minimum clocks inherited from the VBIOS/GOP/UEFI/firmware), it seems that the
card is starting out in a reasonable state and then being thrown into a bad
state later by bad driver code. And that code is part of the DPM (Dynamic Power
Management) system. We are pretty confident that dpm_state.hard_min_level is
stable the whole time, so that's probably not what's throwing the card into a
bad state. But perhaps another value in the DPM table is . . . 

It doesn't make intuitive sense that the soft min/max values would be
problematic since they are presumably "more flexible," but it's possible that
they get calculated out of spec or something and logging them should be
possible like how dpm_state.hard_min_level was logged.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-16 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #109 from Tom B  ---
Created attachment 145080
  --> https://bugs.freedesktop.org/attachment.cgi?id=145080=edit
dmesg with amdgpu.dpm=2

> Tom B., did you try booting with amdgpu.dpm=1 or amdgpu.dpm=2 (default is 
> generally -1 for automatic)? Seems like one of those might enable the new 
> experimental SW SMU v11 feature on Vega20 . . .

Now that is interesting.dpm=-1 is the same as default, and default is 1,
enabled so dpm=1 is what we've been using all along. But dpm=2 and the patch
you linked to are interesting.

I tried it, it didn't help the crashing issue and I was stuck at 30w. As soon
as I started sddm the system froze. I've attached my dmesg from amdgpu.dpm=2
boot. It doesn't fix the issue but it does help answer a few questions I had:


1. The functions in vega20_ppt.c are used with this new patch so that answers
my question from earlier, that's what this file is for and why it contains
similar/identical functions.

2. It explains the difference I found in comment 97: This commit
https://github.com/torvalds/linux/commit/94ed6d0cfdb867be9bf05f03d682980bce5d0036
has the new else block for smu_display_configuration_change which we now know
is the software version of this function.


More importantly, though, knowing that enabling DPM causes the crash, this
tells us either:

A) The bug is present in both versions of the vega20 code: vega20_hwmgr.c and
vega20_ppt.c or..

B) The card reaches an invalid state before DPM is initialised and the card is
fine until it receives a DPM change.

Given that two different versions of the code produce the same result, my hunch
is that the problem is B. The card is not in a state where it's able to receive
power changes.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-16 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #108 from ReddestDream  ---
> Booting with amdgpu.dpm=0 on 5.2.7 works.

Tom B., did you try booting with amdgpu.dpm=1 or amdgpu.dpm=2 (default is
generally -1 for automatic)? Seems like one of those might enable the new
experimental SW SMU v11 feature on Vega20 . . .

https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html

https://lists.freedesktop.org/archives/amd-gfx/2019-January/030788.html?print=anzwix

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-16 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #107 from ReddestDream  ---
> Booting with amdgpu.dpm=0 on 5.2.7 works.

> It is a DPM issue of some kind so although my earlier tests showed that 
> hard_min_level was set correctly, it still could be an issue elsewhere in the 
> DPM table.

Great news! At least now we have a better place to investigate . . .

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-16 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #106 from Tom B  ---
Booting with amdgpu.dpm=0 on 5.2.7 works.

Performance is poor and as expected I cannot get any information about power
states because /sys/kernel/debug/dri/0/amdgpu_pm_info doesn't exist. I'm
guessing it runs at minimum clocks as I get ~10-17fps in unigine-heaven instead
of ~60-100. 

It is a DPM issue of some kind so although my earlier tests showed that
hard_min_level was set correctly, it still could be an issue elsewhere in the
DPM table.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-16 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #105 from Tom B  ---
> Also, I considered that both of my monitors have audio out support. I wonder 
> if audio initialization might be the missing piece to the puzzle, the thing 
> that interrupts/changes the state of the card and prevents 
> smu_send_smc_msg_with_param from working where it did before. I know that in 
> the past with previous AMD cards, display audio has been buggy . 

I just tried setting admgpu.audio=0 and it didn't help. Though it doesn't rule
out audio entirely, the audio backend is probably still used as part of the
connection to the monitor, I'd imagine it just prevents the card appearing as
an output device.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-16 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #104 from Tom B  ---
I did get very similar crashing when I was running HDMI + DP at different
refresh rates ( see https://bugs.freedesktop.org/show_bug.cgi?id=110510 ). I
switched to DP + DP because HDMI+DP wasn't stable, it could be related.

the tl;dir from that bug report, and this was on 5.0.9:

- HDMI alone at 60hz works but the screen flickers off every 3-5 minutes
- HDMI alone works at 59.9hz without any flickering
- HDMI 60hz + DP 60hz works, but the HDMI screen flickers off every 3-5 minutes
- HDMI 59.94hz + DP 60hz freezes the PC instantly.

Unfortunately my monitors don't support displayport at 59.94hz so I couldn't
test that combination as I think it would have worked. 

Still, it does tell us that these could be related and the issue could be
syncing between the two displays.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-16 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #103 from Peter Hercek  ---
I boot in BIOS mode and I'm still getting these errors. Though they are rare in
my case with the "better" kernels (around once a week).

Just a note: There were tearing errors in windows drivers of Radeon VII too.
One of the reasons for it was different refresh rate for different monitors.
They recommended to set all refresh rates to 60 Hz or its multiple till it is
fixed. In my case it is not completely possible (one monitor supports 60 Hz,
but other two monitors support only 59.95 Hz). I have slight difference in the
frequencies.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-16 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #102 from Tom B  ---
> Grasping at straws a bit here, but it occurred to me that maybe Linux kernel 
> testing on Radeon VII was done on an early VBIOS that didn't have full UEFI 
> support yet. We know that AMD had to issue a VBIOS update for Radeon VII to 
> fix UEFI support shortly after the launch. So maybe enabling the CSM/Legacy 
> Support in the BIOS, which does impact early GPU initialization, might have 
> some effect on the multimonitor problem? Something I plan to test, but I 
> wanted to share the idea in case someone else has a chance first.

I had already tried that unfortunately, I tried the following BIOS options:

CSM on/off
IOMMU on/of
PCIE speed 16x/4x (the only options my motherboard allowed for some reason)

Having said that, I didn't try booting using grub in BIOS mode as I  didn't
want to change my partition table, so it's possible that although I had used
CSM, it was only legacy support and still booting in UEFI mode.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-15 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #101 from ReddestDream  ---
Grasping at straws a bit here, but it occurred to me that maybe Linux kernel
testing on Radeon VII was done on an early VBIOS that didn't have full UEFI
support yet. We know that AMD had to issue a VBIOS update for Radeon VII to fix
UEFI support shortly after the launch. So maybe enabling the CSM/Legacy Support
in the BIOS, which does impact early GPU initialization, might have some effect
on the multimonitor problem? Something I plan to test, but I wanted to share
the idea in case someone else has a chance first.

>This might not mean anything, but it could be another clue that initilization 
>is happening before the card is really ready.

Also, I considered that both of my monitors have audio out support. I wonder if
audio initialization might be the missing piece to the puzzle, the thing that
interrupts/changes the state of the card and prevents
smu_send_smc_msg_with_param from working where it did before. I know that in
the past with previous AMD cards, display audio has been buggy . . .

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-14 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #100 from Tom B  ---
I've bee trying to work backwards to find the place where screens get
initialised and eventually call vega20_pre_display_configuration_changed_task. 

vega20_pre_display_configuration_changed_task is exported as
pp_hwmgr_func::display_config_changed

Which is called form hardwaremanager.c:phm_pre_display_configuration_changed 

phm_pre_display_configuration_changed is called from
hwmghr.c:hwmgr_handle_task:

switch (task_id) {
case AMD_PP_TASK_DISPLAY_CONFIG_CHANGE:
ret = phm_pre_display_configuration_changed(hwmgr);


pp_dpm_dispatch_tasks is exported as amd_pm_funcs::dispatch_tasks is called
from amdgpu_dpm_dispatch_task which is called in amdgpu_pm.c:


void amdgpu_pm_compute_clocks(struct amdgpu_device *adev)
{
int i = 0;

if (!adev->pm.dpm_enabled)
return;

if (adev->mode_info.num_crtc)
amdgpu_display_bandwidth_update(adev);

for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
struct amdgpu_ring *ring = adev->rings[i];
if (ring && ring->sched.ready)
amdgpu_fence_wait_empty(ring);
}

if (is_support_sw_smu(adev)) {
struct smu_context *smu = >smu;
struct smu_dpm_context *smu_dpm = >smu.smu_dpm;
mutex_lock(&(smu->mutex));
smu_handle_task(>smu,
smu_dpm->dpm_level,
AMD_PP_TASK_DISPLAY_CONFIG_CHANGE);
mutex_unlock(&(smu->mutex));
} else {
if (adev->powerplay.pp_funcs->dispatch_tasks) {
if (!amdgpu_device_has_dc_support(adev)) {
mutex_lock(>pm.mutex);
amdgpu_dpm_get_active_displays(adev);
adev->pm.pm_display_cfg.num_display =
adev->pm.dpm.new_active_crtc_count;
adev->pm.pm_display_cfg.vrefresh =
amdgpu_dpm_get_vrefresh(adev);
adev->pm.pm_display_cfg.min_vblank_time =
amdgpu_dpm_get_vblank_time(adev);
/* we have issues with mclk switching with
refresh rates over 120 hz on the non-DC code. */
if (adev->pm.pm_display_cfg.vrefresh > 120)
adev->pm.pm_display_cfg.min_vblank_time
= 0;
if
(adev->powerplay.pp_funcs->display_configuration_change)
   
adev->powerplay.pp_funcs->display_configuration_change(
   
adev->powerplay.pp_handle,
   
>pm.pm_display_cfg);
mutex_unlock(>pm.mutex);
}
amdgpu_dpm_dispatch_task(adev,
AMD_PP_TASK_DISPLAY_CONFIG_CHANGE, NULL);
} else {
mutex_lock(>pm.mutex);
amdgpu_dpm_get_active_displays(adev);
amdgpu_dpm_change_power_state_locked(adev);
mutex_unlock(>pm.mutex);
}
}
}


This is the only place I can see AMD_PP_TASK_DISPLAY_CONFIG_CHANGE being called
from, which eventually is where vega20_pre_display_configuration_changed_task
gets called.

Presumably the code:

for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
struct amdgpu_ring *ring = adev->rings[i];
if (ring && ring->sched.ready)
amdgpu_fence_wait_empty(ring);
}



is what generates 


[3.683718] amdgpu :44:00.0: ring gfx uses VM inv eng 0 on hub 0
[3.683719] amdgpu :44:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[3.683720] amdgpu :44:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[3.683720] amdgpu :44:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[3.683721] amdgpu :44:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[3.683722] amdgpu :44:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[3.683722] amdgpu :44:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[3.683723] amdgpu :44:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[3.683724] amdgpu :44:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[3.683724] amdgpu :44:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[3.683725] amdgpu :44:00.0: ring sdma0 uses VM inv eng 0 on hub 1
[3.683726] amdgpu :44:00.0: ring page0 uses VM inv eng 1 on hub 1
[3.683726] amdgpu :44:00.0: ring sdma1 uses VM inv eng 4 on hub 1
[3.683727] amdgpu :44:00.0: ring page1 uses VM inv eng 5 on hub 1
[3.683728] amdgpu :44:00.0: ring uvd_0 uses VM inv eng 6 on hub 1
[3.683728] amdgpu :44:00.0: ring uvd_enc_0.0 uses 

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-14 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #99 from Tom B  ---
Created attachment 145062
  --> https://bugs.freedesktop.org/attachment.cgi?id=145062=edit
a list of commits 5.0.13 - 5.1.0

Attached is a list of all amdgpu and powerplay commits from 5.0.13 - 5.1.0. 

I have tried reverting the following which looked most likely culprits:

919a94d8101ebc29868940b580fe9e9811b7dc86 drm/amdgpu: fix CPDMA hang in PRT mode
for VEGA20

f7b1844bacecca96dd8d813675e4d8adec02cd66 drm/amdgpu: Update gc golden setting
for vega family

d25689760b747287c6ca03cfe0729da63e0717f4 drm/amdgpu/display:
drm/amdgpu/display: Keep malloc ref to MST port  -- A change to the way
displayport connectors are handled, looked promising.

db64a2f43c1bc22c5ff2d22606000b8c3587d0ec drm/amd/powerplay: fix possible hang
with 3+ 4K monitors


I also looked at that last one in detail as it seems very close to this bug.
Nothing in the code looks for 3+ monitors or even 4k. It only actually looks
for > 1 monitor.

Although it's based on disable_mclk_switching, I also tried forcing
disable_fclk_switching to true and false, neither had any affect. The result is
that mclk would be calculated based on screens but fclk would be forced on/off.
 It didn't help but I can't help think that this commit is a little too close
to this issue to be irrelevant.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-13 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #98 from Sylvain BERTRAND  ---
> The code seems very similar to what we see in
> vega20_notify_smc_display_config_after_ps_adjustment near where we get the "
> [SetHardMinFreq] Set hard min uclk failed!" Maybe this
> smum_send_msg_to_smc_with_parameter get through where others fail because of
> the formatting or something?

It seems there is a patch from amd about smu v11 and this smc/smu command.
I may be wrong though.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-13 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #97 from Tom B  ---
I've been investigating this:

https://github.com/torvalds/linux/commit/94ed6d0cfdb867be9bf05f03d682980bce5d0036

Because vega20 doesn't export display_configuration_change, it jumps to the
newly added else block and calls smu_display_configuration_change. This didn't
happen in 5.0.13. It's not the cause of this as I commented it out and it still
breaks. 
I'll also note that pp_display_cfg->display_count is correct at this point, it
shows 2 for me with 2 screens connected. But why doesn't vega20 export
display_configuration_change? It has display_config_changed and I can't find
where that's called from so I wonder if display_config_changed should be being
called at this point.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-13 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #96 from Tom B  ---
Created attachment 145047
  --> https://bugs.freedesktop.org/attachment.cgi?id=145047=edit
logging anywhere the number of screens is set

Again, no closer to a fix but another thing to rule out. In addition to
SMU_MSG_NumOfDisplays, PPSMC_MSG_NumOfDisplays is also used.

I put a debug message anywhere PPSMC_MSG_NumOfDisplays or SMU_MSG_NumOfDisplays
is set end put else blocks in places where it may have been set:

if ((data->water_marks_bitmap & WaterMarksExist) &&
data->smu_features[GNLD_DPM_DCEFCLK].supported &&
data->smu_features[GNLD_DPM_SOCCLK].supported) {

pr_err("vega20_display_configuration_changed_task setting
PPSMC_MSG_NumOfDisplays to %d\n", hwmgr->display_config->num_display);

result = smum_send_msg_to_smc_with_parameter(hwmgr,
PPSMC_MSG_NumOfDisplays,
hwmgr->display_config->num_display);
}
else {
pr_err("vega20_display_configuration_changed_task not setting
PPSMC_MSG_NumOfDisplays\n");
}

return result;
}


Here's what I found:

- The functions dealing with screesn in vega20_ppt.c are never used (
vega20_display_config_changed, vega20_pre_display_config_changed) and can be
ignored for our further tests

- The line: 

result = smum_send_msg_to_smc_with_parameter(hwmgr, 
PPSMC_MSG_NumOfDisplays, hwmgr->display_config->num_display);

Is never executed, it always triggers the else block so PPSMC_MSG_NumOfDisplays
is never set using num_display.

- The same thing happens in 5.0.13, when I saw the above result I had hoped
that the problem was that  smum_send_msg_to_smc_with_parameter(hwmgr,   
PPSMC_MSG_NumOfDisplays, hwmgr->display_config->num_display); was never called
with the correct number of displays. Unfortunately the behaviour is the same on
5.0.13, PPSMC_MSG_NumOfDisplays is only ever set to zero in both versions of
the kernel.


Unfortunately this doesn't get us any closer.


The instruction is sent a lot more in 5.0.13 though. 

5.0.13:

[3.475471] amdgpu :44:00.0: ring vce1 uses VM inv eng 13 on hub 1
[3.475472] amdgpu :44:00.0: ring vce2 uses VM inv eng 14 on hub 1
[3.475508] amdgpu: [powerplay] vega20_display_configuration_changed_task
not setting PPSMC_MSG_NumOfDisplays
[3.794037] amdgpu: [powerplay]
vega20_pre_display_configuration_changed_task setting PPSMC_MSG_NumOfDisplays
to 0
[3.800180] amdgpu: [powerplay] vega20_display_configuration_changed_task
not setting PPSMC_MSG_NumOfDisplays
[3.833502] amdgpu: [powerplay]
vega20_pre_display_configuration_changed_task setting PPSMC_MSG_NumOfDisplays
to 0
[3.833647] amdgpu: [powerplay] vega20_display_configuration_changed_task
not setting PPSMC_MSG_NumOfDisplays
[4.153232] [drm] Initialized amdgpu 3.27.0 20150101 for :44:00.0 on
minor 0
[4.664044] amdgpu: [powerplay]
vega20_pre_display_configuration_changed_task setting PPSMC_MSG_NumOfDisplays
to 0


5.2.7
[3.711028] amdgpu :44:00.0: ring vce1 uses VM inv eng 13 on hub 1
[3.711028] amdgpu :44:00.0: ring vce2 uses VM inv eng 14 on hub 1
[4.086310] amdgpu: [powerplay]
vega20_pre_display_configuration_changed_task setting PPSMC_MSG_NumOfDisplays
to 0
[4.385470] [drm] Initialized amdgpu 3.32.0 20150101 for :44:00.0 on
minor 0
[4.522398] amdgpu: [powerplay] Failed to send message 0x28, response 0x0

Notice that vega20_pre_display_configuration_changed_task is run 5 times
between the ring lines and initilization line in 5.0.13 and only once in 5.2.7.

This might not mean anything, but it could be another clue that initilization
is happening before the card is really ready.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-13 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #95 from Tom B  ---
So here's something interesting. In 5.0.13 there is no function
vega20_display_config_changed.  This function issues
smu_send_smc_msg_with_param(smu, SMU_MSG_NumOfDisplays, 0);

In fact, in 5.0.13 there is no reference at all to SMU_MSG_NumOfDisplays
anywhere in the amdgpu driver. 

Which means, the way that the number of displays is configured is changed in
5.0.13, or done with a hardcoded value instead of a constant.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-13 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #94 from Tom B  ---
Reverting d1a3e239a6016f2bb42a91696056e223982e8538 didn't fix it for me. But
that commit may give some insight because it is related to uclk which is the
first error we get.

I also tried globally increasing usec_timeout as it's used in a few places
(patch below). This makes the PC take about a minute to boot up, so clearly the
GPU is in an invalid state before these timeouts are hit and then each
subsequent call to smum_send_msg_to_smc_with_parameter causes a delay because
each call times out. Whatever happens, puts the card into a state that it can't
recover from.

The next step is to try to find where vega20_set_uclk_to_highest_dpm_level is
called from and see what happens just before the call to this function.



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f4ac632a87b2..9b878c74b17e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2418,7 +2418,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
adev->pdev = pdev;
adev->flags = flags;
adev->asic_type = flags & AMD_ASIC_MASK;
-   adev->usec_timeout = AMDGPU_MAX_USEC_TIMEOUT;
+   adev->usec_timeout = AMDGPU_MAX_USEC_TIMEOUT*10;
if (amdgpu_emu_mode == 1)
adev->usec_timeout *= 2;
adev->gmc.gart_size = 512 * 1024 * 1024;
diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c
b/drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c
index a7e8340baf90..a6b2bc4277ef 100644
--- a/drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c
+++ b/drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c
@@ -84,7 +84,7 @@ int hwmgr_early_init(struct pp_hwmgr *hwmgr)
if (!hwmgr)
return -EINVAL;

-   hwmgr->usec_timeout = AMD_MAX_USEC_TIMEOUT;
+   hwmgr->usec_timeout = AMD_MAX_USEC_TIMEOUT*10;
hwmgr->pp_table_version = PP_TABLE_V1;
hwmgr->dpm_level = AMD_DPM_FORCED_LEVEL_AUTO;
hwmgr->request_dpm_level = AMD_DPM_FORCED_LEVEL_AUTO;

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-12 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #93 from Chris Hodapp  ---
Note: It might be good for someone else to double-check my conclusion before
too much stock is put into it. Scientific method and all that.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-12 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #92 from ReddestDream  ---
>If you follow the callstack:

I've been thinking all this over. The only thing unfortunately that really
sticks out at me still is how Chris Hodapp says that reverting this commit:

https://github.com/torvalds/linux/commit/d1a3e239a6016f2bb42a91696056e223982e8538#diff-0bc07842bc28283d64ffa6dd2ed716de

Seems to improve things. Considering that we now know from Tom B.'s work that
dpm_state.hard_min_level is apparently calculated correctly and stable the
entire time, it doesn't make sense that reverting this commit could fix
anything. 

The code seems very similar to what we see in
vega20_notify_smc_display_config_after_ps_adjustment near where we get the "
[SetHardMinFreq] Set hard min uclk failed!" Maybe this
smum_send_msg_to_smc_with_parameter get through where others fail because of
the formatting or something?

Thanks again Tom B. for all your testing. I'd like to do some tests of my own,
but time's just not permitting for me ATM. Hoping to be more free next weekend.
:/

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-12 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #91 from ReddestDream  ---
>It returns 0 on success and -EIO on failure, which is then in turn returned 
>from vega20_set_fclk_to_highest_dpm_leve. Where did you see the check/retry on 
>EINVAL? Perhaps -EIO should be -EINVAL?

I didn't find check/retry code. It was more just a thought that maybe we could
keep vega20_set_uclk_to_highest_dpm_level from just returning despite the error
and allowing further initialization to proceed. Even if it crashed, that might
be even be helpful since it's not clear if it's the initialization
(drm_dev_register) or something else that is silent in the logs that is
changing something and causing vega20_set_uclk_to_highest_dpm_level to fail
where we know it succeeded so many times before.

>I'm not sure this is helpful but I managed to somewhat test the race condition 
>theory.

If there is a race, I'm not sure it's in the time the driver waits for the
hardware registers to respond and/or the value to set. But it's still
enlightening.

At this point it seems more likely that something else we aren't seeing in the
logs is breaking vega20_set_uclk_to_highest_dpm_level in the last moments
(unlikely due to the dpm_state.hard_min_level value), it falls through and
drm_dev_register runs and initialization message prints. amdgpu doesn't
consider the "[SetUclkToHightestDpmLevel] Set hard min uclk failed!" to be a
significant enough error to stop initialization. But maybe it should . . .

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-12 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #90 from Tom B  ---
I'm not sure this is helpful but I managed to somewhat test the race condition
theory.

If you follow the callstack:

vega20_set_fclk_to_highest_dpm_level -> smum_send_msg_to_smc_with_parameter ->
vega20_send_msg_to_smc_with_parameter -> vega20_wait_for_response ->
phm_wait_for_register_unequal you find this code in smu_helper.c:

int phm_wait_on_register(struct pp_hwmgr *hwmgr, uint32_t index,
 uint32_t value, uint32_t mask)
{
uint32_t i;
uint32_t cur_value;

if (hwmgr == NULL || hwmgr->device == NULL) {
pr_err("Invalid Hardware Manager!");
return -EINVAL;
}

for (i = 0; i < hwmgr->usec_timeout; i++) {
cur_value = cgs_read_register(hwmgr->device, index);
if ((cur_value & mask) == (value & mask))
break;
udelay(1);
}

/* timeout means wrong logic*/
if (i == hwmgr->usec_timeout)
return -1;
return 0;
}


The timeout there is interesting. I increased it.


for (i = 0; i < hwmgr->usec_timeout*10; i++) {
cur_value = cgs_read_register(hwmgr->device, index);
if ((cur_value & mask) == (value & mask))
break;
udelay(1);
}


The PC takes significantly longer to boot (10 or so seconds when it's usually
instant) and the error still occurs. So I'm not sure it's just a matter of
waiting.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-12 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #89 from Tom B  ---
> It should return -EINVAL instead. Maybe then it would reset and try again 
> instead of just ignoring it and continuing with initialization anyway, 
> leading to instability.

If you look at vega20_send_msg_to_smc_with_parameter: 

static int vega20_send_msg_to_smc_with_parameter(struct pp_hwmgr *hwmgr,
uint16_t msg, uint32_t parameter)
{
struct amdgpu_device *adev = hwmgr->adev;
int ret = 0;

vega20_wait_for_response(hwmgr);

WREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90, 0);

WREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_82, parameter);

vega20_send_msg_to_smc_without_waiting(hwmgr, msg);

ret = vega20_wait_for_response(hwmgr);
if (ret != PPSMC_Result_OK)
pr_err("Failed to send message 0x%x, response 0x%x\n", msg,
ret);

return (ret == PPSMC_Result_OK) ? 0 : -EIO;
}


It returns 0 on success and -EIO on failure, which is then in turn returned
from vega20_set_fclk_to_highest_dpm_leve. Where did you see the check/retry on
EINVAL? Perhaps -EIO should be -EINVAL?

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-12 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #88 from ReddestDream  ---
>The question then becomes: Why doesn't the race condition happen with only one 
>screen? Perhaps it's a matter of speed. With a single display, the driver 
>detect the displays, read/parse the EDID data, initialize in time. But then 
>that doesn't explain why the crash still occurs if you boot with one 
>DisplayPort monitor and attach another after X is running.

I do suspect it's a matter of speed and complexity when you have more monitors.
Also maybe the clock it tries to set (the value of hard_min_level) is different
if you only have one monitor and somehow that takes more time (resetting it
away from some default).

I do wonder if maybe in:

"[SetUclkToHightestDpmLevel] Set hard min uclk failed!",
return ret);

It should return -EINVAL instead. Maybe then it would reset and try again
instead of just ignoring it and continuing with initialization anyway, leading
to instability.

>One thing I've been trying to work out is the difference between vega21_ppt.c 
>and   vega20_hwmgr.c is, as they both contain slightly different or identical 
>versions of the same functions. It looks like the functions in vega20_hwmgr.c  
>take precedence but it's strange to see this duplication and both files are 
>worked on in the commit history.

Hmm. That is interesting. I'll take a look.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-12 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #87 from Tom B  ---
> Could be we've got a race condition between the powerplay setup and amdgpu
handing off the card to drm_dev_register to advertise it for normal use.

The question then becomes: Why doesn't the race condition happen with only one
screen? Perhaps it's a matter of speed. With a single display, the driver
detect the displays, read/parse the EDID data, initialize in time. But then
that doesn't explain why the crash still occurs if you boot with one
DisplayPort monitor and attach another after X is running.

One thing I've been trying to work out is the difference between vega21_ppt.c
and   vega20_hwmgr.c is, as they both contain slightly different or identical
versions of the same functions. It looks like the functions in vega20_hwmgr.c 
take precedence but it's strange to see this duplication and both files are
worked on in the commit history.

Take a look at vega20_set_uclk_to_highest_dpm_level and
vega20_apply_clocks_adjust_rules in both for examples.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-12 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #86 from ReddestDream  ---
>In addition to that, vega20_set_fclk_to_highest_dpm_level is called several 
>times before the card is initialized and even on 5.2.7 works. Something 
>happens during or just before the initialization stage that stops 
>smum_send_msg_to_smc_with_parameter accepting 1001 as a valid value, as it 
>does until that point.

Could be we've got a race condition between the powerplay setup and amdgpu
handing off the card to drm_dev_register to advertise it for normal use.

drm_dev_register is responsible for the "[drm] Initialized" message:

https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/drm_drv.c#L994

And it seems like amdgpu calls it here:

https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L1054

Odd that it's doing this if powerplay still has more work to do. And that might
be why vega20_set_uclk_to_highest_dpm_level fails that last time.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-12 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #85 from Tom B  ---

> Yeah. I've had boots where I have my 2 4K DP monitors in and I don't get 
> powerplay error on boot. In fact, it can go a bit and seem stable.

In addition to that, vega20_set_fclk_to_highest_dpm_level is called several
times before the card is initialized and even on 5.2.7 works. Something happens
during or just before the initialization stage that stops
smum_send_msg_to_smc_with_parameter accepting 1001 as a valid value, as it does
until that point.

I think you're right about BACO, it was worth looking at but I applied a quick
hack to ensure it's disabled:

int vega20_baco_set_state(struct pp_hwmgr *hwmgr, enum BACO_STATE state)
{
return 0;
}

int vega20_baco_get_capability(struct pp_hwmgr *hwmgr, bool *cap)
{
*cap = false;
return 0;
}

No difference, I still get the errors and wrong wattage so unless BACO is
somehow on by default and only turned off in the proper version of this code,
we can rule it out.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-12 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #84 from ReddestDream  ---
>Need to figure out what exactly what is generating the line "[drm] Initialized 
>amdgpu 3.27.0 20150101 for :44:00.0 on minor 0."

That "Initialized amdgpu" message seems to be coming from here:

https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/drm_drv.c#L994

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-12 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #83 from ReddestDream  ---
> Here's what I found: The value of hard_min_level is 1001 in both 5.0.13 and 
> 5.2.7 so the issue is not the value from the dpm table. The dpm table is 
> probably correct. 

Fantastic! Glad you tested this. I had suspected the hard_min_level was bogus
and that's why it was failing. Card was rejecting the bogus value. Glad to know
that's not the case.

> However, what is interesting is that it doesn't always fail.

Yeah. I've had boots where I have my 2 4K DP monitors in and I don't get
powerplay error on boot. In fact, it can go a bit and seem stable. But then the
powerplay errors suddenly (not related to some high load on the card) start
showing up again and the graphics become unstable. Similarly others have
reported that on hotplugging a second monitor after boot, the powerplay errors
will start showing up.

So, maybe there is a timing problem involved with sending the message. It's
generally a question of when rather than if it's going to fail.

> 1. vega20_set_fclk_to_highest_dpm_level is called twice between the "ring 
> vce2" line and "Initialized"

Is it always called twice? Even on 5.2.7? Because it looks like it might get
called two times right before "Initialized" on 5.0.13 but then only once on
5.2.7 before "Initialized" kicks in. Maybe "Initialized" is interrupting on
5.2.7 but not on 5.0.13. It's possible that Initialization of the card is
messing up values that powerplay needs to read off the card or making the card
unavailable for receiving messages or something . . .

> So initialization is happening between (and possibly a result of) sending the 
> message and getting the response

Yeah. Something is definitely happening while
vega20_set_uclk_to_highest_dpm_level is running . . . Not 100% sure that's
really problematic tho . . .  But it could be an atomicity issue. Need to
figure out what exactly what is generating the line "[drm] Initialized amdgpu
3.27.0 20150101 for :44:00.0 on minor 0." Looks like it's coming from the
drm core rather than amdgpu specifically.

> I'm going to see if I can disable/revert BACO entirely to at least rule it 
> out.

I thought BACO was reverted for Vega 20 here:

https://github.com/torvalds/linux/commit/7db329e57b90ddebcb58fc88eedbb3082d22a957#diff-8a4d25be8ad5d9c3ff27bb54b678dab2

Your commit seems to have been introduced in 5.2-rc1, not 5.1.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-12 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #82 from Tom B  ---
In addition, I will note that the file vega20_baco.c has been added in 5.1 

details: https://www.phoronix.com/scan.php?page=news_item=AMD-Vega-12-BACO


commit:
https://github.com/torvalds/linux/commit/0c5ccf14f50431d0196b96025c878ae9f45676a9#diff-c2d82e6f1326b5b4e0a09c9cb42cbcc2
 


This seems like quite a large change, and requires a special "workaround" for
Vega 20. Unfortunately, this seems like quite a large code restructure in the
driver as I cannot just revert that single commit. 

I mention this because part of the problem I am seeing is with the wrong
wattage. I wonder whether BACO wrongly tries to turn off a part of the card
that is required for a secondary monitor and as such puts the card in an
invalid state.

I'm going to see if I can disable/revert BACO entirely to at least rule it out.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-12 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #81 from Tom B  ---
Created attachment 145038
  --> https://bugs.freedesktop.org/attachment.cgi?id=145038=edit
5.2.7 dmesg with hard_min_level logged

As mentioned in the previous post, I started logging the value of
hard_min_level. I hadn't realised that vega20_set_uclk_to_highest_dpm_level
would be called so many times.

Here's what I found: The value of hard_min_level is 1001 in both 5.0.13 and
5.2.7 so the issue is not the value from the dpm table. The dpm table is
probably correct. Something prevents smum_send_msg_to_smc_with_parameter
accepting the value.

However, what is interesting is that it doesn't always fail.


[4.082105] amdgpu: [powerplay] hard_min_level: 1001
[4.372684] [drm] Initialized amdgpu 3.32.0 20150101 for :44:00.0 on
minor 0
[4.517204] amdgpu: [powerplay] Failed to send message 0x28, response 0x0
[4.517205] amdgpu: [powerplay] [SetUclkToHightestDpmLevel] Set hard min
uclk failed!





Each hard_min_level line in the log is from
vega20_set_uclk_to_highest_dpm_level and there are multiple calls to it, which
don't fail, before the card is initialised.


This is from 5.2.7:

[3.698907] amdgpu :44:00.0: ring vce2 uses VM inv eng 14 on hub 1
[4.082105] amdgpu: [powerplay] hard_min_level: 1001
[4.372684] [drm] Initialized amdgpu 3.32.0 20150101 for :44:00.0 on
minor 0
[4.517204] amdgpu: [powerplay] Failed to send message 0x28, response 0x0
[4.517205] amdgpu: [powerplay] [SetUclkToHightestDpmLevel] Set hard min
uclk failed!
[5.361482] amdgpu: [powerplay] Failed to send message 0x28, response 0x0


And the same from 5.0.13:

[3.352380] amdgpu :44:00.0: ring vce2 uses VM inv eng 14 on hub 1
[3.722422] amdgpu: [powerplay] hard_min_level: 1001
[3.766269] amdgpu: [powerplay] hard_min_level: 1001
[4.029679] [drm] Initialized amdgpu 3.27.0 20150101 for :44:00.0 on
minor 0


There are a couple of things here:

1. vega20_set_fclk_to_highest_dpm_level is called twice between the "ring vce2"
line and "Initialized"

2. My patched code looks like this:

pr_err("hard_min_level: %d\n",
dpm_table->dpm_state.hard_min_level);

PP_ASSERT_WITH_CODE(!(ret =
smum_send_msg_to_smc_with_parameter(hwmgr,
PPSMC_MSG_SetHardMinByFreq,
(PPCLK_UCLK << 16 ) |
dpm_table->dpm_state.hard_min_level)),
"[SetUclkToHightestDpmLevel] Set hard min uclk
failed!",
return ret);

Yet the log shows:

- My debug line 
- Initialized amdgpu 3.32.0 20150101 for :44:00.0 on minor 0
- [SetUclkToHightestDpmLevel] Set hard min uclk failed!

So initialization is happening between (and possibly a result of) sending the
message and getting the response.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-12 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #80 from Tom B  ---
> I tried something like that before but a huge portion of the commits in that 
> range won't build kernels that can boot (at least on my system). I ended up 
> resorting to trying reverting individual vega20-affecting  commits out of 
> 5.1. See my results far above in the thread (though someone else willing to 
> spend more time doing a deeper analysis of the code could probably take my 
> approach much further).

That's why my focus has been finding places in the code where something
different happens based on the number of displays. Though this may be a futile
avenue of exploration as it could just be an issue of additional memory
bandwith requirements or even something that should be done differently with 2
displays that isn't.

> It does make me wonder if it's worth testing like 2 simple 1080p 60 Hz 
> displays. Maybe that wouldn't trigger this issue. Not that that would really 
> be of use to me. But it might help distinguish between just monitor detect 
> generally being broken and "high monitor load" being broken . . .

This would be an interesting test but I think 1080p 60hz monitors with
displayport are fairly uncommon and I don't have any to test with. My guess is
anyone with a Radeon VII, a high end card with 16gb VRAM, is likely to have a
high end display which could equally explain why there are no reports here of
people running 1080p 60hz displays. 

My next test is going to be logging dpm_table->dpm_state.hard_min_level on line
3354 (just before it's sent to the smc) on both 5.0.13 and 5.2.7 to see if the
same hard_min_level value is sent to the smc on both kernels. This will at
least let us know whether it's something that's incorrectly setting
hard_min_level or something that prevents the smc accepting the value. My hunch
from my previous tests is that it's the latter but I'll try it and report back.

I know nothing about driver development so I have no idea how this stuff should
work, I can only compare the differences between 5.0.13 and later kernels.

Anyway, thanks everyone for your input. Any information, even on things that
you tried and didn't work, is valuable as it can help us narrow down the
problem.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-11 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #79 from ReddestDream  ---
>I tried something like that before but a huge portion of the commits in that 
>range won't build kernels that can boot (at least on my system).

It's interesting that you found d1a3e239a6016f2bb42a91696056e223982e8538 to
improve the issue:

https://github.com/torvalds/linux/commit/d1a3e239a6016f2bb42a91696056e223982e8538#diff-0bc07842bc28283d64ffa6dd2ed716de

>From Tom B.'s and my review of the code, it seems very likely that somehow a
failure to set a hard minimum properly is at the heart of the issue. 

>This brings me to the second thing: When looking through the commits, I 
>noticed that there were multiple commits that claim to prevent or reduce 
>crashing in high-resolution situations (one references 5k displays, another 
>references 3+ 4k displays).

Yeah. I have 2 4K displays as well. But I don't think it should really be
straining the card. These commits are probably overzealous for Radeon VII.
Rather it could be that at least part of the issue, especially the excessive
power draw at idle, is just due to these commits artificially setting minimums
very high. In fact, that could be why it's stable at all with just one monitor,
since the code to set the minimums up is only being triggered when there are
more monitors connected.

I'd suspect a boottime configuration issue too, but others have reported
instability even when the monitors are hotplugged later on. So, it seems like
maybe the monitor detect might at least partially be okay, but the
follow-through with raising the clock minimums is broken. I suspect the issue
is in the code calculating the minimum to set, so the driver gets stuck trying
to send incomplete/incorrect values to the card.

https://bbs.archlinux.org/viewtopic.php?id=247733

It does make me wonder if it's worth testing like 2 simple 1080p 60 Hz
displays. Maybe that wouldn't trigger this issue. Not that that would really be
of use to me. But it might help distinguish between just monitor detect
generally being broken and "high monitor load" being broken . . .

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-11 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #78 from Chris Hodapp  ---
> I don't see anywhere else to go but bisection from 5.0.13 to 5.1. That should 
> at least find something . . .

I tried something like that before but a huge portion of the commits in that
range won't build kernels that can boot (at least on my system). I ended up
resorting to trying reverting individual vega20-affecting  commits out of 5.1.
See my results far above in the thread (though someone else willing to spend
more time doing a deeper analysis of the code could probably take my approach
much further).

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-11 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #77 from ReddestDream  ---
>I guess, you are good for a bisection if you have a "working" kernel.

This is, based on everything here, I'm not convinced that 5.0.13 has 0 issues.
Only that it seems to have fewer issues. But yeah. I don't see anywhere else to
go but bisection from 5.0.13 to 5.1. That should at least find something . . .

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-11 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #76 from Sylvain BERTRAND  ---
> Unfortunately, it does look like going through and slowing disabling features
> and/or bisecting might be the only way to find how this issue got started. At
> least if we could narrow it down, we might be in better shape. :/

I guess, you are good for a bisection if you have a "working" kernel.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-11 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #75 from ReddestDream  ---
>Here's some additional investigation.

>[SetUclkToHightestDpmLevel] Set hard min uclk failed! Appears as one of the 
>first errors in dmesg. This is from vega20_hwmgr.c:3354 and triggered by:

I agree that [SetUclkToHightestDpmLevel] is probably the key to all this as it
always seems to be the first thing that fails after dysregulation occurs. The
"Failed to send message 0x28, response 0x0" errors show that the driver is
sending wrong or at least wrongly timed commands to the GPU that eventually
cascade into complete failure.

>Again, it didn't help. I will note that this code is identical in 5.0.13 

I have also been unable to find changed code since 5.0 that could be directly
connected to display detect/init/enumeration issues on Radeon VII/Vega 20. This
is why I've come to suspect the error is triggered indirectly in a way that
will probably not be obvious and by code that was likely flawed from the
beginning of Radeon VII/Vega 20 support.

This is also why I was hopeful that 5.3-rc2 would fix this issue since it has
commits that do seem to affect display detection on AMD GPUs. Alas, it did not.
:(

>If the GPU did not crash with dpm disabled as a whole, the proper way to
proceed would be to start from there and step by step add dpm features and see
when it starts crashing. It's not a small task since dpm code paths may be
scattered all over the code.

Unfortunately, it does look like going through and slowing disabling features
and/or bisecting might be the only way to find how this issue got started. At
least if we could narrow it down, we might be in better shape. :/

I must admit I don't have much experience with graphics drivers and when I tell
other people about this issue, they immediately want to blame X or Mesa until I
explain that I can get these errors w/o starting any graphics at all. lol.

In any case, I really appreciate your testing Tom B. And any advice you might
have on debugging, Sylvain BERTRAND, is greatly appreciated. :)

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

2019-08-11 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #74 from Sylvain BERTRAND  ---
Forcing the memory clock and voltage is not enough: the dc[en]x memory requests
should be given also the highest priority in the arbiter block. I don't recall
how it interacts with the dc[en]x watermarks, but they should be "disabled" or
"maxed out". Basically, whatever the 3D/compute/(vcn|vce/uvd) load, the dc[en]x
will always come first (due to the realtime nature of display data transmission
to monitors). Oh and of course, the smu/smc should not manage the dc[en]x. Very
probably, there are some smc/smu commands to do that.

If the GPU did not crash with dpm disabled as a whole, the proper way to
proceed would be to start from there and step by step add dpm features and see
when it starts crashing. It's not a small task since dpm code paths may be
scattered all over the code.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

  1   2   >