From: Bjorn Helgaas <bhelg...@google.com> This work is mostly due to Jon Pan-Doh and Karolina Stolarek. I rebased this to v6.15-rc1, factored out some of the trace and statistics updates, and added some minor cleanups.
Proposal ======== When using native AER, spammy devices can flood kernel logs with AER errors and slow/stall execution. Add per-device per-error-severity ratelimits for more robust error logging. Allow userspace to configure ratelimits via sysfs knobs. Motivation ========== Inconsistent PCIe error handling, exacerbated at datacenter scale (myriad of devices), affects repairabilitiy flows for fleet operators. Exposing PCIe errors/debug info in-band for a userspace daemon (e.g. rasdaemon) to collect/pass on to repairability services will allow for more predictable repair flows and decrease machine downtime. Background ========== AER error spam has been observed many times, both publicly (e.g. [1], [2], [3]) and privately. While it usually occurs with correctable errors, it can happen with uncorrectable errors (e.g. during new HW bringup). There have been previous attempts to add ratelimits to AER logs ([4], [5]). The most recent attempt[5] has many similarities with the proposed approach. v6: - Rebase to v6.15-rc1 - Initialize struct aer_err_info completely before using it - Log DPC Error Source ID only when it's valid - Consolidate AER Error Source ID logging to one place - Tidy Error Source ID bus/dev/fn decoding using macros - Rename aer_print_port_info() to aer_print_source() - Consolidate trace events and statistic updates to one non-ratelimited place - Save log level in struct aer_err_info instead of passing as parameter v5: https://lore.kernel.org/r/20250321015806.954866-1-pan...@google.com - Handle multi-error AER by evaluating ratelimits once and storing result - Reword/rename commit messages/functions/variable v4: https://lore.kernel.org/r/20250320082057.622983-1-pan...@google.com - Fix bug where trace not emitted with malformed aer_err_info - Extend ratelimit to malformed aer_err_info - Update commit messages with patch motivation - Squash AER sysfs filename change (Patch 8) v3: https://lore.kernel.org/r/20250319084050.366718-1-pan...@google.com - Ratelimit aer_print_port_info() (drop Patch 1) - Add ratelimit enable toggle - Move trace outside of ratelimit - Split log level (Patch 2) into two - More descriptive documentation/sysfs naming v2: https://lore.kernel.org/r/20250214023543.992372-1-pan...@google.com - Rebased on top of pci/aer (6.14.rc-1) - Split series into log and IRQ ratelimits (defer patch 5) - Dropped patch 8 (Move AER sysfs) - Added log level cleanup patch[7] from Karolina's series - Fixed bug where dpc errors didn't increment counters - "X callbacks suppressed" message on ratelimit release -> immediately - Separate documentation into own patch v1: https://lore.kernel.org/r/20250115074301.3514927-1-pan...@google.com [1] https://bugzilla.kernel.org/show_bug.cgi?id=215027 [2] https://bugzilla.kernel.org/show_bug.cgi?id=201517 [3] https://bugzilla.kernel.org/show_bug.cgi?id=196183 [4] https://lore.kernel.org/linux-pci/20230606035442.2886343-2-grund...@chromium.org/ [5] https://lore.kernel.org/linux-pci/cover.1736341506.git.karolina.stola...@oracle.com/ [6] https://lore.kernel.org/linux-pci/8bcb8c9a7b38ce3bdaca5a64fe76f08b0b337511.1742202797.git.k arolina.stola...@oracle.com/ [7] https://lore.kernel.org/linux-pci/edd77011aafad4c0654358a26b4e538d0c5a321d.1736341506.git.k arolina.stola...@oracle.com/ Bjorn Helgaas (9): PCI/DPC: Initialize aer_err_info before using it PCI/DPC: Log Error Source ID only when valid PCI/AER: Consolidate Error Source ID logging in aer_print_port_info() PCI/AER: Extract bus/dev/fn in aer_print_port_info() with PCI_BUS_NUM(), etc PCI/AER: Move aer_print_source() earlier in file PCI/AER: Initialize aer_err_info before using it PCI/AER: Simplify pci_print_aer() PCI/AER: Update statistics early in logging PCI/AER: Combine trace_aer_event() with statistics updates Jon Pan-Doh (4): PCI/AER: Rename aer_print_port_info() to aer_print_source() PCI/AER: Introduce ratelimit for error logs PCI/AER: Add ratelimits to PCI AER Documentation PCI/AER: Add sysfs attributes for log ratelimits Karolina Stolarek (3): PCI/AER: Check log level once and remember it PCI/AER: Make all pci_print_aer() log levels depend on error type PCI/AER: Rename struct aer_stats to aer_report ...es-aer_stats => sysfs-bus-pci-devices-aer} | 34 ++ Documentation/PCI/pcieaer-howto.rst | 16 +- drivers/pci/pci-sysfs.c | 1 + drivers/pci/pci.h | 5 +- drivers/pci/pcie/aer.c | 346 ++++++++++++------ drivers/pci/pcie/dpc.c | 49 ++- include/linux/pci.h | 2 +- 7 files changed, 329 insertions(+), 124 deletions(-) rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (77%) -- 2.43.0