On Mon, May 18, 2026 at 02:23:36PM +0100, Yury Murashka wrote: > pci_aer_clear_nonfatal_status() is not called when AER recovery fails. > If a new AER error is subsequently reported, the AER driver calls > find_source_device() to find the source of the error. It rescans the > whole bus and picks the first device reporting an AER error. Because the > previous error was never cleared, the error is attributed to the wrong > device and AER recovery is started for the wrong device.
I guess the rationale of the current behavior is that the devices affected by the failed error recovery are basically in a broken state once error recovery failed and so user intervention is required, e.g. a remove/rescan via sysfs. My question is, why is error recovery failing for the devices in the first place? And what does the hierarchy look like? (lspci -tv and lspci -vvv output please) I also don't quite follow your assertion that (only) the first device reporting an error is picked. The algorithm tries to collect *all* error-reporting devices in the affected portion of the hierarchy. Thanks, Lukas
