[+cc Lukas] On Mon, May 18, 2026 at 02:23:36PM +0100, Yury Murashka wrote: > pci_aer_clear_nonfatal_status() is not called when AER recovery fails. > If a new AER error is subsequently reported, the AER driver calls > find_source_device() to find the source of the error. It rescans the > whole bus and picks the first device reporting an AER error. Because the > previous error was never cleared, the error is attributed to the wrong > device and AER recovery is started for the wrong device. > > Add a kernel boot parameter pci=aer_clear_on_recovery_failure to clear > AER error status even when recovery fails, preventing stale errors from > causing incorrect device identification on subsequent AER events.
Why should we add a kernel parameter for this? How would a user decide whether to use the parameter? Are there cases where we find the source of the first error, but we *wouldn't* want to clear it if recovery fails?
