On Sat, May 17, 2025 at 12:55:14AM +0800, Hans Zhang wrote: > The following series introduces a new kernel command-line option aer_panic > to enhance error handling for PCIe Advanced Error Reporting (AER) in > mission-critical environments. This feature ensures deterministic recover > from fatal PCIe errors by triggering a controlled kernel panic when device > recovery fails, avoiding indefinite system hangs. > > Problem Statement > In systems where unresolved PCIe errors (e.g., bus hangs) occur, > traditional error recovery mechanisms may leave the system unresponsive > indefinitely. This is unacceptable for high-availability environment > requiring prompt recovery via reboot. > > Solution > The aer_panic option forces a kernel panic on unrecoverable AER errors. > This bypasses prolonged recovery attempts and ensures immediate reboot. >
You should not panic the kernel when a PCI error occurs (even if it is a fatal one). You should instead try to reset the root complex. For that you need this series that got merged recently: https://lore.kernel.org/all/20250508-pcie-reset-slot-v4-0-7050093e2...@linaro.org PS: You need to populate the slot_reset callback in your controller driver to reset the controller in the event of a fatal AER error or link down. - Mani -- மணிவண்ணன் சதாசிவம்