On 2025/5/22 19:47, Manivannan Sadhasivam wrote:
On Sat, May 17, 2025 at 12:55:14AM +0800, Hans Zhang wrote:
The following series introduces a new kernel command-line option aer_panic
to enhance error handling for PCIe Advanced Error Reporting (AER) in
mission-critical environments. This feature ensures deterministic recover
from fatal PCIe errors by triggering a controlled kernel panic when device
recovery fails, avoiding indefinite system hangs.
Problem Statement
In systems where unresolved PCIe errors (e.g., bus hangs) occur,
traditional error recovery mechanisms may leave the system unresponsive
indefinitely. This is unacceptable for high-availability environment
requiring prompt recovery via reboot.
Solution
The aer_panic option forces a kernel panic on unrecoverable AER errors.
This bypasses prolonged recovery attempts and ensures immediate reboot.
You should not panic the kernel when a PCI error occurs (even if it is a fatal
one). You should instead try to reset the root complex. For that you need this
series that got merged recently:
https://lore.kernel.org/all/20250508-pcie-reset-slot-v4-0-7050093e2...@linaro.org
PS: You need to populate the slot_reset callback in your controller driver to
reset the controller in the event of a fatal AER error or link down.
Dear Mani,
Thank you for your reply. I will take a look at the submission record
you provided.
Best regards,
Hans