Bug#1053864: libdrm-amdgpu1: gpu crash on graphics start with Radeon 760M (both sway and gdm3)

2023-11-12 Thread Simon Heath
Oop, my bad.  I was wondering why I hadn't seen it go through on the bug 
report...


The issue is still present in apt package linux-image-6.5.0-3 (Kernel 
6.5.8-1) , and linux-image-6.5.0-4 (kernel 6.5.10-1). Same messages, as 
far as I can see, but here's the dmesg output from the 6.5.10-1 kernel 
in case there's something subtly different.


Thanks,
Simon





[    7.490078] ucsi_acpi USBC000:00: ucsi_handle_connector_change: 
GET_CONNECTOR_STATUS failed (-5)

[    7.605873] ucsi_acpi USBC000:00: possible UCSI driver bug 1
[    7.605903] ucsi_acpi USBC000:00: ucsi_handle_connector_change: 
GET_CONNECTOR_STATUS failed (-22)
[   13.555707] pipewire[1065]: memfd_create() called without MFD_EXEC or 
MFD_NOEXEC_SEAL set
[   23.808871] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 
timeout, signaled seq=23, emitted seq=25
[   23.809320] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process 
information: process  pid 0 thread  pid 0

[   23.809592] amdgpu :c1:00.0: amdgpu: GPU reset begin!
[   23.990678] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 
[amdgpu]] *ERROR* MES failed to response msg=3
[   23.990842] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* 
failed to unmap legacy queue
[   24.124228] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 
[amdgpu]] *ERROR* MES failed to response msg=3
[   24.124374] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* 
failed to unmap legacy queue
[   24.257754] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 
[amdgpu]] *ERROR* MES failed to response msg=3
[   24.257918] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* 
failed to unmap legacy queue
[   24.391326] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 
[amdgpu]] *ERROR* MES failed to response msg=3
[   24.391555] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* 
failed to unmap legacy queue
[   24.525068] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 
[amdgpu]] *ERROR* MES failed to response msg=3
[   24.525211] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* 
failed to unmap legacy queue
[   24.658617] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 
[amdgpu]] *ERROR* MES failed to response msg=3
[   24.658758] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* 
failed to unmap legacy queue
[   24.792155] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 
[amdgpu]] *ERROR* MES failed to response msg=3
[   24.792326] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* 
failed to unmap legacy queue
[   24.925815] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 
[amdgpu]] *ERROR* MES failed to response msg=3
[   24.925961] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* 
failed to unmap legacy queue
[   25.059344] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 
[amdgpu]] *ERROR* MES failed to response msg=3
[   25.059488] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* 
failed to unmap legacy queue

[   25.061023] amdgpu :c1:00.0: amdgpu: MODE2 reset
[   25.090107] amdgpu :c1:00.0: amdgpu: GPU reset succeeded, trying 
to resume
[   25.090767] [drm] PCIE GART of 512M enabled (table at 
0x00801FD0).

[   25.090889] amdgpu :c1:00.0: amdgpu: SMU is resuming...
[   25.092526] amdgpu :c1:00.0: amdgpu: SMU is resumed successfully!
[   25.094267] [drm] DMUB hardware initialized: version=0x08000E00
[   25.101834] [drm] REG_WAIT timeout 1us * 1000 tries - 
dcn314_dsc_pg_control line:264
[   25.104428] [drm] REG_WAIT timeout 1us * 1000 tries - 
dcn314_dsc_pg_control line:272
[   25.107025] [drm] REG_WAIT timeout 1us * 1000 tries - 
dcn314_dsc_pg_control line:280
[   25.109617] [drm] REG_WAIT timeout 1us * 1000 tries - 
dcn314_dsc_pg_control line:288
[   25.117187] [drm] REG_WAIT timeout 1us * 1000 tries - 
dcn314_dsc_pg_control line:264
[   25.119782] [drm] REG_WAIT timeout 1us * 1000 tries - 
dcn314_dsc_pg_control line:272
[   25.122380] [drm] REG_WAIT timeout 1us * 1000 tries - 
dcn314_dsc_pg_control line:280
[   25.124993] [drm] REG_WAIT timeout 1us * 1000 tries - 
dcn314_dsc_pg_control line:288

[   25.534004] [drm] kiq ring mec 3 pipe 1 q 0
[   25.536314] [drm] VCN decode and encode initialized 
successfully(under DPG Mode).
[   25.536470] amdgpu :c1:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] 
JPEG decode initialized successfully.
[   25.537196] amdgpu :c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv 
eng 0 on hub 0
[   25.537200] amdgpu :c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv 
eng 1 on hub 0
[   25.537202] amdgpu :c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv 
eng 4 on hub 0
[   25.537204] amdgpu :c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv 
eng 6 on hub 0
[   25.537206] amdgpu :c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv 
eng 7 on hub 0
[   25.537208] amdgpu :c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv 
eng 8 on hub 0
[   25.537210] amdgpu :c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv 
eng 9 on hub 0
[   25.537212] 

Bug#1053864: libdrm-amdgpu1: gpu crash on graphics start with Radeon 760M (both sway and gdm3)

2023-11-08 Thread Diederik de Haas
Control: tag -1 moreinfo

On Fri, 13 Oct 2023 00:47:57 -0400 Simon Heath  wrote:
> Package: libdrm-amdgpu1
> Version: 2.4.115-1
> 
> When GDM3 starts, or when I turn it off and log into the console by hand
> and then start sway or another WM, often the graphics mode switch will
> hang for a few seconds on an unresponsive black screen, then go back to
> a text console for an instant and try again.  This seems to repeat 0-3
> times until eventually it works successfully.  Sometimes it works on the
> first try, often on the second try, etc.
> 
> Once Sway or GDM3 and Xorg have actually started, it *seems* perfectly
> stable, as far as I've seen so far.
> 
> I also see the following errors in dmesg associated with the
> apparent-crash-and-restart:
> 
> [   26.625039] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, 
> signaled seq=23, emitted seq=25
> [   26.625482] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process 
> information: process  pid 0 thread  pid 0
> [   26.625820] amdgpu :c1:00.0: amdgpu: GPU reset begin!
> [   26.810595] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 
> [amdgpu]] *ERROR* MES failed to response msg=3
> [   26.810761] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to 
> unmap legacy queue
> ...
> Kernel: Linux 6.5.0-1-amd64 (SMP w/12 CPU threads; PREEMPT)

Those messages are actually from the kernel driver.
Can you test whether the issue is still present with kernel 6.5.8-1 (Testing)
and if so, also try it with 6.5.10-1 from Unstable?

signature.asc
Description: This is a digitally signed message part.


Bug#1053864: libdrm-amdgpu1: gpu crash on graphics start with Radeon 760M (both sway and gdm3)

2023-10-12 Thread Simon Heath
Package: libdrm-amdgpu1
Version: 2.4.115-1
Severity: normal
X-Debbugs-Cc: ice...@dreamquest.io

Dear Maintainer,

When GDM3 starts, or when I turn it off and log into the console by hand
and then start sway or another WM, often the graphics mode switch will
hang for a few seconds on an unresponsive black screen, then go back to
a text console for an instant and try again.  This seems to repeat 0-3
times until eventually it works successfully.  Sometimes it works on the
first try, often on the second try, etc.

Once Sway or GDM3 and Xorg have actually started, it *seems* perfectly
stable, as far as I've seen so far.

This is a brand new GPU chipset afaik so graphics bugs are pretty
understandable.

CPU: AMD Ryzen 5 7640U w/ Radeon 760M Graphics
Extended renderer info from `glxinfo`:
Device: AMD Radeon Graphics (gfx1103_r1, LLVM 16.0.6, DRM 3.54, 
6.5.0-1-amd64) (0x15bf)
Version: 23.2.1

I also see the following errors in dmesg associated with the
apparent-crash-and-restart:

[   26.625039] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, 
signaled seq=23, emitted seq=25
[   26.625482] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: 
process  pid 0 thread  pid 0
[   26.625820] amdgpu :c1:00.0: amdgpu: GPU reset begin!
[   26.810595] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 
[amdgpu]] *ERROR* MES failed to response msg=3
[   26.810761] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to 
unmap legacy queue
[   26.944169] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 
[amdgpu]] *ERROR* MES failed to response msg=3
[   26.944310] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to 
unmap legacy queue
[   27.077693] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 
[amdgpu]] *ERROR* MES failed to response msg=3
[   27.077834] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to 
unmap legacy queue
[   27.211163] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 
[amdgpu]] *ERROR* MES failed to response msg=3
[   27.211303] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to 
unmap legacy queue
[   27.344634] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 
[amdgpu]] *ERROR* MES failed to response msg=3
[   27.344776] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to 
unmap legacy queue
[   27.478028] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 
[amdgpu]] *ERROR* MES failed to response msg=3
[   27.478175] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to 
unmap legacy queue
[   27.611499] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 
[amdgpu]] *ERROR* MES failed to response msg=3
[   27.611640] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to 
unmap legacy queue
[   27.744960] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 
[amdgpu]] *ERROR* MES failed to response msg=3
[   27.745097] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to 
unmap legacy queue
[   27.878425] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 
[amdgpu]] *ERROR* MES failed to response msg=3
[   27.878564] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to 
unmap legacy queue
[   27.880086] amdgpu :c1:00.0: amdgpu: MODE2 reset
[   27.909811] amdgpu :c1:00.0: amdgpu: GPU reset succeeded, trying to 
resume
[   27.910426] [drm] PCIE GART of 512M enabled (table at 0x00801FD0).
[   27.910540] amdgpu :c1:00.0: amdgpu: SMU is resuming...
[   27.911480] amdgpu :c1:00.0: amdgpu: SMU is resumed successfully!
[   27.913327] [drm] DMUB hardware initialized: version=0x08000E00
[   27.918776] [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control 
line:264
[   27.921376] [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control 
line:272
[   27.923969] [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control 
line:280
[   27.926566] [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control 
line:288
[   27.934650] [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control 
line:264
[   27.937248] [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control 
line:272
[   27.939841] [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control 
line:280
[   27.942439] [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control 
line:288
[   28.328853] [drm] kiq ring mec 3 pipe 1 q 0
[   28.331133] [drm] VCN decode and encode initialized successfully(under DPG 
Mode).
[   28.331252] amdgpu :c1:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG 
decode initialized successfully.
[   28.331965] amdgpu :c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on 
hub 0
[   28.331968] amdgpu :c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 
on hub 0
[   28.331971] amdgpu :c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 
on hub 0
[   28.331973] amdgpu :c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 
on hub 0
[   28.331975] amdgpu