Hi guys, My (Intel Sapphire Rapids) workstation has a RX 7800 XT and when I kexec a bunch of times, the amdgpu driver gets upset and barfs on boot.
It starts like so: [ 16.926489] amdgpu 0000:19:00.0: amdgpu: Found VCN firmware Version ENC: 1.23 DEC: 9 VEP: 0 Revision: 16 [ 16.980590] amdgpu 0000:19:00.0: amdgpu: reserve 0xa700000 from 0x83e0000000 for PSP TMR [ 19.204585] amdgpu 0000:19:00.0: amdgpu: failed to load ucode SMC(0x32) [ 19.227333] amdgpu 0000:19:00.0: amdgpu: psp gfx command LOAD_IP_FW(0x6) failed and response status is (0x0) [ 19.256420] amdgpu 0000:19:00.0: amdgpu: PSP load smu failed! [ 19.467875] [drm:psp_v13_0_ring_destroy [amdgpu]] *ERROR* Fail to stop psp ring [ 19.491771] amdgpu 0000:19:00.0: amdgpu: PSP firmware loading failed [ 19.513372] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22 [ 19.540397] amdgpu 0000:19:00.0: amdgpu: amdgpu_device_ip_init failed [ 19.562177] amdgpu 0000:19:00.0: amdgpu: Fatal error during GPU init [ 19.583785] amdgpu 0000:19:00.0: amdgpu: amdgpu: finishing device. [ 19.605474] ------------[ cut here ]------------ [ 19.615370] WARNING: CPU: 0 PID: 704 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:631 amdgpu_irq_put+0x46/0x70 [amdgpu] [ 19.638375] Modules linked in: rndis_host hid_generic cdc_ether usbhid usbnet mii hid amdgpu(+) amdxcp gpu_sched drm_panel_backlight_quirks drm_buddy drm_ttm_helper ttm video wmi drm_exec drm_suballoc_helper drm_display_helper ast ah ci cec rc_core iTCO_wdt libahci drm_shmem_helper xhci_pci drm_client_lib intel_pmc_bxt libata xhci_hcd iTCO_vendor_support igb nvme watchdog drm_kms_helper idxd atlantic intel_lpss_pci usbcore scsi_mod i2c_algo_bit drm nvme_core i2c_i80 1 idxd_bus crc16 intel_lpss macsec dca i2c_smbus idma64 scsi_common ucsi_acpi typec_ucsi typec roles usb_common pinctrl_alderlake button efivarfs [ 19.754852] CPU: 0 UID: 0 PID: 704 Comm: kworker/0:5 Not tainted 6.15.0-dirty #51 PREEMPT(full) [ 19.773770] Hardware name: Supermicro SYS-531A-I/X13SRA-TF, BIOS 1.1b 08/01/2023 [ 19.789693] Workqueue: events work_for_cpu_fn [ 19.799066] RIP: 0010:amdgpu_irq_put+0x46/0x70 [amdgpu] [ 19.810480] Code: c0 74 33 48 8b 4e 10 48 83 39 00 74 29 89 d1 48 8d 04 88 8b 08 85 c9 74 11 f0 ff 08 74 07 31 c0 c3 cc cc cc cc e9 5a fd ff ff <0f> 0b b8 ea ff ff ff c3 cc cc cc cc b8 ea ff ff ff c3 cc cc cc cc [ 19.851066] RSP: 0018:ff55eefd81aafd48 EFLAGS: 00010246 [ 19.862314] RAX: ff466ca3653aac00 RBX: ff466ca2d7f98b40 RCX: 0000000000000000 [ 19.877675] RDX: 0000000000000000 RSI: ff466ca2d7fa5990 RDI: ff466ca2d7f80000 [ 19.893037] RBP: ff466ca2d7f90388 R08: 0000000000000000 R09: ff55eefd81aafb10 [ 19.908401] R10: ff466cc1ffcd2fa8 R11: 0000000000000003 R12: ff466ca2d7f90830 [ 19.923763] R13: ff466ca2d7f80010 R14: ff466ca2d7f80000 R15: ff466ca2d7fa5990 [ 19.939132] FS: 0000000000000000(0000) GS:ff466cc1db2ee000(0000) knlGS:0000000000000000 [ 19.956551] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 19.968920] CR2: 00007f45f54e3de8 CR3: 000000207e624003 CR4: 0000000000f71ef0 [ 19.984282] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 19.999645] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 20.015010] PKRU: 55555554 [ 20.020820] Call Trace: [ 20.026075] <TASK> [ 20.030581] amdgpu_fence_driver_hw_fini+0xfc/0x130 [amdgpu] [ 20.042894] amdgpu_device_fini_hw+0xb7/0x2c6 [amdgpu] [ 20.054152] amdgpu_driver_load_kms.cold+0x18/0x2e [amdgpu] [ 20.066323] amdgpu_pci_probe+0x1cf/0x470 [amdgpu] [ 20.076775] local_pci_probe+0x42/0x90 [ 20.084839] work_for_cpu_fn+0x17/0x30 [ 20.092899] process_one_work+0x188/0x340 [ 20.101523] worker_thread+0x256/0x3a0 [ 20.109584] ? __pfx_worker_thread+0x10/0x10 [ 20.118767] kthread+0xf9/0x240 [ 20.125519] ? __pfx_kthread+0x10/0x10 [ 20.133578] ret_from_fork+0x31/0x50 [ 20.141268] ? __pfx_kthread+0x10/0x10 [ 20.149326] ret_from_fork_asm+0x1a/0x30 [ 20.157765] </TASK> [ 20.162457] ---[ end trace 0000000000000000 ]--- and then continues to barf for a while longer. Full dmesg attached. When I do a full power cycle its okay again for a few kexecs, but will ultimately go unhappy again. I'm doing a 'normal' systemctl kexec, which I figure should more or less shut things down normally. Its not like a crash-kexec -- which is a whole other story and can be expected to cause trouble.
dmesg.gz
Description: application/gzip