Hello, > This work is mostly due to Jon Pan-Doh and Karolina Stolarek. I rebased > this to v6.15-rc1, factored out some of the trace and statistics updates, > and added some minor cleanups. > > Proposal > ======== > > When using native AER, spammy devices can flood kernel logs with AER errors > and slow/stall execution. Add per-device per-error-severity ratelimits for > more robust error logging. Allow userspace to configure ratelimits via > sysfs knobs. > > Motivation > ========== > > Inconsistent PCIe error handling, exacerbated at datacenter scale (myriad > of devices), affects repairabilitiy flows for fleet operators. > > Exposing PCIe errors/debug info in-band for a userspace daemon (e.g. > rasdaemon) to collect/pass on to repairability services will allow for more > predictable repair flows and decrease machine downtime. > > Background > ========== > > AER error spam has been observed many times, both publicly (e.g. [1], [2], > [3]) and privately. While it usually occurs with correctable errors, it can > happen with uncorrectable errors (e.g. during new HW bringup). > > There have been previous attempts to add ratelimits to AER logs ([4], [5]). > The most recent attempt[5] has many similarities with the proposed > approach.
I have been testing this series locally with and without faults triggered using the AER error injection facility. No issues thus far. And, as such... Tested-by: Krzysztof Wilczyński <kwilczyn...@kernel.org> Thank you! Krzysztof