When a device lacks an error_detected callback, AER recovery fails and the device is left in a disconnected state. This can mask serious hardware issues during development and testing.
Add a module parameter 'aer_unrecoverable_fatal' that panics the kernel instead, making such failures immediately visible. The parameter defaults to false to preserve existing behavior. Signed-off-by: Breno Leitao <[email protected]> --- In environments where all hardware must be fully operational, silently leaving a device in a disconnected state after an AER recovery failure is unacceptable. This is common in high-reliability systems, production servers, and testing infrastructure where a degraded system should not continue running. This patch adds a module parameter that allows administrators to enforce a strict policy: if a device cannot recover from an AER error, the kernel panics instead of continuing with degraded hardware. This ensures that hardware failures are immediately visible and can trigger appropriate remediation (restart, failover, alerting). --- Documentation/admin-guide/kernel-parameters.txt | 9 +++++++++ drivers/pci/pcie/err.c | 3 +++ drivers/pci/pcie/portdrv.c | 7 +++++++ drivers/pci/pcie/portdrv.h | 1 + 4 files changed, 20 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 1058f2a6d6a8c..ff95c24280e3c 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5240,6 +5240,15 @@ Kernel parameters nomsi Do not use MSI for native PCIe PME signaling (this makes all PCIe root ports use INTx for all services). + pcieportdrv.aer_unrecoverable_fatal= + [PCIE] Panic on unrecoverable AER errors: + 0 Log the error and leave the device in a disconnected + state (default). + 1 Panic the kernel when a device cannot recover from an + AER error (no error_detected callback). Useful for + high-reliability systems where degraded hardware is + unacceptable. + pcmv= [HW,PCMCIA] BadgePAD 4 pd_ignore_unused diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index bebe4bc111d75..788484791902e 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -73,6 +73,9 @@ static int report_error_detected(struct pci_dev *dev, if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) { vote = PCI_ERS_RESULT_NO_AER_DRIVER; pci_info(dev, "can't recover (no error_detected callback)\n"); + if (aer_unrecoverable_fatal) + panic("AER: %s: no error_detected callback\n", + pci_name(dev)); } else { vote = PCI_ERS_RESULT_NONE; } diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c index 38a41ccf79b9a..a411f60ff50ce 100644 --- a/drivers/pci/pcie/portdrv.c +++ b/drivers/pci/pcie/portdrv.c @@ -22,6 +22,13 @@ #include "../pci.h" #include "portdrv.h" +#ifdef CONFIG_PCIEAER +bool aer_unrecoverable_fatal; +module_param(aer_unrecoverable_fatal, bool, 0644); +MODULE_PARM_DESC(aer_unrecoverable_fatal, + "Panic if a device cannot recover from an AER error (default: false)"); +#endif + /* * The PCIe Capability Interrupt Message Number (PCIe r3.1, sec 7.8.2) must * be one of the first 32 MSI-X entries. Per PCI r3.0, sec 6.8.3.1, MSI diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h index bd29d1cc7b8bd..6c67b18de93c9 100644 --- a/drivers/pci/pcie/portdrv.h +++ b/drivers/pci/pcie/portdrv.h @@ -29,6 +29,7 @@ extern bool pcie_ports_dpc_native; #ifdef CONFIG_PCIEAER int pcie_aer_init(void); +extern bool aer_unrecoverable_fatal; #else static inline int pcie_aer_init(void) { return 0; } #endif --- base-commit: 6bd9ed02871f22beb0e50690b0c3caf457104f7c change-id: 20260206-pci-362cf172187f Best regards, -- Breno Leitao <[email protected]>
