You have been subscribed to a public bug:

Apologies in advance if this isn't the right place to file this bug.
Please let me know if I should report this elsewhere or if there's any
other info I can add.

What I suspect is an amdgpu driver issue has been causing display issues
and machine crashes. Once the issue starts, the display won't come back
from being blank, and turning the machine off takes five minutes or
longer. Additionally, executing `sensors` hangs on reading data from the
gpu and can even cause 100% CPU utilization for multiple minutes. I
believe this significant delays to other system calls, as the entire
machine will start to behave sluggishly in spurts where every program
hangs and then many things happen all at once. I can't reliably trigger
the issue, although repeatedly reading `sensors` in a loop seems to be
one method that eventually works.

The main symptom other than the behavior described above is a cycle of
messages in the journal of the form

kernel: amdgpu:
    last message was failed ret is 0
kernel: amdgpu:
    failed to send message 145 ret is 0
kernel: amdgpu:
    last message was failed ret is 0
kernel: amdgpu:
    failed to send message 146 ret is 0

The messages sent are 145, 146, 5e, and 148.

I've had this GPU for 5 years without any of these problems. However, I
only recently upgraded from an older Intel CPU to a Ryzen 5600 and a
ASRock B550m Pro4. I have no idea if the issues are related to the
upgrade or how they could be.

Some misc info.
Distro: Ubuntu 22.04.1 LTS x86_64
Kernel: 5.15.0-48-generic
Graphics card: Radeon R9 380X
Motherboard: ASRock B550M Pro4 (P2.30 BIOS)
CPU: Ryzen 5600
Desktop: i3 (with picom compositor)

I'm attaching the relevant logs from my most recent boot when I was able
to boot and use the machine for several hours. I left the machine to
blank its screen and when I returned, I was unable to unblank the
screen. The only thing I could do was press the power button and leave
the machine to shutdown over the course of the next 5 or 10 minutes.

The only other thing that is interesting is that on startup I'm seeing
the following warning about undefined behavior.

UBSAN: shift-out-of-bounds in 
/build/linux-kQ6jNR/linux-5.15.0/drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device_queue_manager.c:997:32
shift exponent 64 is too large for 64-bit type 'long long unsigned int'
CPU: 10 PID: 483 Comm: systemd-udevd Not tainted 5.15.0-48-generic #54-Ubuntu
Hardware name: To Be Filled By O.E.M. B550M Pro4/B550M Pro4, BIOS P2.30 
02/24/2022
Call Trace:
 <TASK>
 show_stack+0x52/0x5c
 dump_stack_lvl+0x4a/0x63
 dump_stack+0x10/0x16
 ubsan_epilogue+0x9/0x49
 __ubsan_handle_shift_out_of_bounds.cold+0x61/0xef
 initialize_nocpsch.cold+0x15/0x59 [amdgpu]
 device_queue_manager_init+0x20b/0x3b0 [amdgpu]
 kgd2kfd_device_init.cold+0x1af/0x483 [amdgpu]
 amdgpu_amdkfd_device_init+0x135/0x170 [amdgpu]
 amdgpu_device_ip_init+0x681/0x6a4 [amdgpu]
loop33: detected capacity change from 0 to 8
 amdgpu_device_init.cold+0x25b/0x7db [amdgpu]
 ? do_pci_enable_device+0xdb/0x110
 amdgpu_driver_load_kms+0x1e/0x270 [amdgpu]
 amdgpu_pci_probe+0x1ce/0x260 [amdgpu]
 local_pci_probe+0x4b/0x90
 pci_device_probe+0x119/0x1f0
 really_probe+0x222/0x420
 __driver_probe_device+0x119/0x190
 driver_probe_device+0x23/0xc0
 __driver_attach+0xbd/0x1e0
 ? __device_attach_driver+0x120/0x120
 bus_for_each_dev+0x7e/0xd0
 driver_attach+0x1e/0x30
 bus_add_driver+0x148/0x220
 driver_register+0x95/0x100
 __pci_register_driver+0x68/0x70
 amdgpu_init+0x7c/0x1000 [amdgpu]
 ? 0xffffffffc1a40000
 do_one_initcall+0x48/0x1e0
 ? kmem_cache_alloc_trace+0x19e/0x2e0
 do_init_module+0x52/0x260
 load_module+0xacd/0xbc0
 __do_sys_finit_module+0xbf/0x120
 __x64_sys_finit_module+0x18/0x20
 do_syscall_64+0x5c/0xc0
 ? syscall_exit_to_user_mode+0x27/0x50
 ? __x64_sys_newfstatat+0x1c/0x30
 ? do_syscall_64+0x69/0xc0
 ? __x64_sys_mmap+0x33/0x50
 ? do_syscall_64+0x69/0xc0
 ? do_syscall_64+0x69/0xc0
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x7f06f3fb9a3d
Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 
c3 48 8b 0d c3 a3 0f 00 f7 d8 64 89 01 48
RSP: 002b:00007ffc7ce54ae8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 0000556c9ab3e3d0 RCX: 00007f06f3fb9a3d
RDX: 0000000000000000 RSI: 00007f06f4150441 RDI: 000000000000001a
RBP: 0000000000020000 R08: 0000000000000000 R09: 0000000000000002
R10: 000000000000001a R11: 0000000000000246 R12: 00007f06f4150441
R13: 0000556c9aa05fb0 R14: 0000556c9ab40460 R15: 0000556c9ab35150
 </TASK>

It's even visible on screen before the splash screen appears. I don't
remember seeing this before the motherboard/CPU upgrade.

I haven't tried to trigger the issue from a bootable USB, but I can
confirm that the warning about undefined behavior is present there as
well.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: bot-comment jammy
-- 
amdgpu driver hangs periodically, causes display to permanently crash
https://bugs.launchpad.net/bugs/1990323
You received this bug notification because you are a member of Kernel Packages, 
which is subscribed to linux in Ubuntu.

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to