amdgpu vs kexec

Peter Zijlstra Mon, 16 Jun 2025 07:06:02 -0700

Hi guys,

My (Intel Sapphire Rapids) workstation has a RX 7800 XT and when I kexec
a bunch of times, the amdgpu driver gets upset and barfs on boot.


It starts like so:

[   16.926489] amdgpu 0000:19:00.0: amdgpu: Found VCN firmware Version ENC: 
1.23 DEC: 9 VEP: 0 Revision: 16
[   16.980590] amdgpu 0000:19:00.0: amdgpu: reserve 0xa700000 from 0x83e0000000 
for PSP TMR
[   19.204585] amdgpu 0000:19:00.0: amdgpu: failed to load ucode SMC(0x32)
[   19.227333] amdgpu 0000:19:00.0: amdgpu: psp gfx command LOAD_IP_FW(0x6) 
failed and response status is (0x0)
[   19.256420] amdgpu 0000:19:00.0: amdgpu: PSP load smu failed!
[   19.467875] [drm:psp_v13_0_ring_destroy [amdgpu]] *ERROR* Fail to stop psp 
ring
[   19.491771] amdgpu 0000:19:00.0: amdgpu: PSP firmware loading failed
[   19.513372] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP 
block <psp> failed -22
[   19.540397] amdgpu 0000:19:00.0: amdgpu: amdgpu_device_ip_init failed
[   19.562177] amdgpu 0000:19:00.0: amdgpu: Fatal error during GPU init
[   19.583785] amdgpu 0000:19:00.0: amdgpu: amdgpu: finishing device.
[   19.605474] ------------[ cut here ]------------
[   19.615370] WARNING: CPU: 0 PID: 704 at 
drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:631 amdgpu_irq_put+0x46/0x70 [amdgpu]
[   19.638375] Modules linked in: rndis_host hid_generic cdc_ether usbhid 
usbnet mii hid amdgpu(+) amdxcp gpu_sched drm_panel_backlight_quirks drm_buddy 
drm_ttm_helper ttm video wmi drm_exec drm_suballoc_helper drm_display_helper 
ast ah
ci cec rc_core iTCO_wdt libahci drm_shmem_helper xhci_pci drm_client_lib 
intel_pmc_bxt libata xhci_hcd iTCO_vendor_support igb nvme watchdog 
drm_kms_helper idxd atlantic intel_lpss_pci usbcore scsi_mod i2c_algo_bit drm 
nvme_core i2c_i80
1 idxd_bus crc16 intel_lpss macsec dca i2c_smbus idma64 scsi_common ucsi_acpi 
typec_ucsi typec roles usb_common pinctrl_alderlake button efivarfs
[   19.754852] CPU: 0 UID: 0 PID: 704 Comm: kworker/0:5 Not tainted 
6.15.0-dirty #51 PREEMPT(full)
[   19.773770] Hardware name: Supermicro SYS-531A-I/X13SRA-TF, BIOS 1.1b 
08/01/2023
[   19.789693] Workqueue: events work_for_cpu_fn
[   19.799066] RIP: 0010:amdgpu_irq_put+0x46/0x70 [amdgpu]
[   19.810480] Code: c0 74 33 48 8b 4e 10 48 83 39 00 74 29 89 d1 48 8d 04 88 
8b 08 85 c9 74 11 f0 ff 08 74 07 31 c0 c3 cc cc cc cc e9 5a fd ff ff <0f> 0b b8 
ea ff ff ff c3 cc cc cc cc b8 ea ff ff ff c3 cc cc cc cc
[   19.851066] RSP: 0018:ff55eefd81aafd48 EFLAGS: 00010246
[   19.862314] RAX: ff466ca3653aac00 RBX: ff466ca2d7f98b40 RCX: 0000000000000000
[   19.877675] RDX: 0000000000000000 RSI: ff466ca2d7fa5990 RDI: ff466ca2d7f80000
[   19.893037] RBP: ff466ca2d7f90388 R08: 0000000000000000 R09: ff55eefd81aafb10
[   19.908401] R10: ff466cc1ffcd2fa8 R11: 0000000000000003 R12: ff466ca2d7f90830
[   19.923763] R13: ff466ca2d7f80010 R14: ff466ca2d7f80000 R15: ff466ca2d7fa5990
[   19.939132] FS:  0000000000000000(0000) GS:ff466cc1db2ee000(0000) 
knlGS:0000000000000000
[   19.956551] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   19.968920] CR2: 00007f45f54e3de8 CR3: 000000207e624003 CR4: 0000000000f71ef0
[   19.984282] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   19.999645] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[   20.015010] PKRU: 55555554
[   20.020820] Call Trace:
[   20.026075]  <TASK>
[   20.030581]  amdgpu_fence_driver_hw_fini+0xfc/0x130 [amdgpu]
[   20.042894]  amdgpu_device_fini_hw+0xb7/0x2c6 [amdgpu]
[   20.054152]  amdgpu_driver_load_kms.cold+0x18/0x2e [amdgpu]
[   20.066323]  amdgpu_pci_probe+0x1cf/0x470 [amdgpu]
[   20.076775]  local_pci_probe+0x42/0x90
[   20.084839]  work_for_cpu_fn+0x17/0x30
[   20.092899]  process_one_work+0x188/0x340
[   20.101523]  worker_thread+0x256/0x3a0
[   20.109584]  ? __pfx_worker_thread+0x10/0x10
[   20.118767]  kthread+0xf9/0x240
[   20.125519]  ? __pfx_kthread+0x10/0x10
[   20.133578]  ret_from_fork+0x31/0x50
[   20.141268]  ? __pfx_kthread+0x10/0x10
[   20.149326]  ret_from_fork_asm+0x1a/0x30
[   20.157765]  </TASK>
[   20.162457] ---[ end trace 0000000000000000 ]---

and then continues to barf for a while longer. Full dmesg attached.

When I do a full power cycle its okay again for a few kexecs, but will
ultimately go unhappy again.

I'm doing a 'normal' systemctl kexec, which I figure should more or less
shut things down normally. Its not like a crash-kexec -- which is a
whole other story and can be expected to cause trouble.

dmesg.gz
Description: application/gzip

amdgpu vs kexec

Reply via email to