Re: [PATCH v6 00/16] Rate limit AER logs

Krzysztof Wilczyński Tue, 20 May 2025 02:05:35 -0700

Hello,

> This work is mostly due to Jon Pan-Doh and Karolina Stolarek.  I rebased
> this to v6.15-rc1, factored out some of the trace and statistics updates,
> and added some minor cleanups.
> 
> Proposal
> ========
> 
> When using native AER, spammy devices can flood kernel logs with AER errors
> and slow/stall execution. Add per-device per-error-severity ratelimits for
> more robust error logging. Allow userspace to configure ratelimits via
> sysfs knobs.
> 
> Motivation
> ==========
> 
> Inconsistent PCIe error handling, exacerbated at datacenter scale (myriad
> of devices), affects repairabilitiy flows for fleet operators.
> 
> Exposing PCIe errors/debug info in-band for a userspace daemon (e.g.
> rasdaemon) to collect/pass on to repairability services will allow for more
> predictable repair flows and decrease machine downtime.
> 
> Background
> ==========
> 
> AER error spam has been observed many times, both publicly (e.g. [1], [2],
> [3]) and privately. While it usually occurs with correctable errors, it can
> happen with uncorrectable errors (e.g. during new HW bringup).
> 
> There have been previous attempts to add ratelimits to AER logs ([4], [5]).
> The most recent attempt[5] has many similarities with the proposed
> approach.


I have been testing this series locally with and without faults triggered
using the AER error injection facility.  No issues thus far.

And, as such...

Tested-by: Krzysztof Wilczyński <kwilczyn...@kernel.org>

Thank you!

        Krzysztof

Re: [PATCH v6 00/16] Rate limit AER logs

Reply via email to