In this list on 2025-06-16 Peter Zijlstra reported "amdgpu vs kexec":
https://lists.freedesktop.org/archives/amd-gfx/2025-June/126086.html and there was follow-up with possible solutions but with no reported resolution. I'm trying to resolve the same issue and would welcome some help. In Peter's thread there were two suggested options for amdgpu_pci_shutdown(): 1. amdgpu_asic_reset(adev); (suggested by Mario Limonciello) 2. amdgpu_dpm_set_mp1_state(adev, PP_MP1_STATE_UNLOAD); (suggested my Alex Deucher) I also went back through all commits to amdgpu_pci_shutdown() and saw in faefba95c9e8ca3 the call to amdgpu_pci_remove() was replaced by amdgpu_suspend(): 3. amdgpu_pci_remove(); I've tried all three; individually and combined (1 followed by 2 followed by 3) - none manage it. (3) triggers a stack trace before the kexec and after restart amdgpu triggers many traces and then the PC does a full power reset. (1) and (2) kexec then amdgpu triggers many traces then hangs the system. I have (large) logs for all three captured over serial port and can provide a link to them if required. As Peter found I also needed to add EXPORT_SYMBOL(kexec_in_progress) so the loadable module would link. My eventual combined patch showing all three options is: diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 848e6b7db482d..81384eaada538 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -35,6 +35,7 @@ #include <linux/cc_platform.h> #include <linux/console.h> #include <linux/dynamic_debug.h> +#include <linux/kexec.h> #include <linux/module.h> #include <linux/mmu_notifier.h> #include <linux/pm_runtime.h> @@ -2583,9 +2584,21 @@ amdgpu_pci_shutdown(struct pci_dev *pdev) */ if (!amdgpu_passthrough(adev)) adev->mp1_state = PP_MP1_STATE_UNLOAD; +#ifdef CONFIG_KEXEC + if (kexec_in_progress) + adev->mp1_state = PP_MP1_STATE_UNLOAD; +#endif amdgpu_device_prepare(dev); amdgpu_device_suspend(dev, true); adev->mp1_state = PP_MP1_STATE_NONE; +#ifdef CONFIG_KEXEC + if (kexec_in_progress) + amdgpu_asic_reset(adev); +#endif +#ifdef CONFIG_KEXEC + if (kexec_in_progress) + amdgpu_pci_remove(pdev); +#endif } static int amdgpu_pmops_prepare(struct device *dev) diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 95c585c6ddc33..5c4d88df7466b 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -54,6 +54,7 @@ atomic_t __kexec_lock = ATOMIC_INIT(0); /* Flag to indicate we are going to kexec a new kernel */ bool kexec_in_progress = false; +EXPORT_SYMBOL(kexec_in_progress); bool kexec_file_dbg_print; My workstation uses a FirePro W4100 (Southern Islands / Cape Verde / TAHITI ). Until now it was using radeon. Until maybe Spring 2025 it would successfully kexec but I've lost track of the version where it worked. Recently wanted to get kexec working again so thought I'd try amdgpu and found it has the same issue. In working on this over the last few days saw ccd3b4c7c37fbbd3 "drm/amdgpu: Use amdgpu by default on SI dedicated GPUs (v2)" so need to resolve this issue somehow and it seems amdgpu is the place to do that. Thanks. Tj.
