From: Bjorn Helgaas <bhelg...@google.com>

This work is mostly due to Jon Pan-Doh and Karolina Stolarek.  I rebased
this to v6.15-rc1, factored out some of the trace and statistics updates,
and added some minor cleanups.

Proposal
========

When using native AER, spammy devices can flood kernel logs with AER errors
and slow/stall execution. Add per-device per-error-severity ratelimits for
more robust error logging. Allow userspace to configure ratelimits via
sysfs knobs.

Motivation
==========

Inconsistent PCIe error handling, exacerbated at datacenter scale (myriad
of devices), affects repairabilitiy flows for fleet operators.

Exposing PCIe errors/debug info in-band for a userspace daemon (e.g.
rasdaemon) to collect/pass on to repairability services will allow for more
predictable repair flows and decrease machine downtime.

Background
==========

AER error spam has been observed many times, both publicly (e.g. [1], [2],
[3]) and privately. While it usually occurs with correctable errors, it can
happen with uncorrectable errors (e.g. during new HW bringup).

There have been previous attempts to add ratelimits to AER logs ([4], [5]).
The most recent attempt[5] has many similarities with the proposed
approach.


v6:
- Rebase to v6.15-rc1
- Initialize struct aer_err_info completely before using it
- Log DPC Error Source ID only when it's valid
- Consolidate AER Error Source ID logging to one place
- Tidy Error Source ID bus/dev/fn decoding using macros
- Rename aer_print_port_info() to aer_print_source()
- Consolidate trace events and statistic updates to one non-ratelimited place
- Save log level in struct aer_err_info instead of passing as parameter
v5: https://lore.kernel.org/r/20250321015806.954866-1-pan...@google.com
- Handle multi-error AER by evaluating ratelimits once and storing result
- Reword/rename commit messages/functions/variable
v4: https://lore.kernel.org/r/20250320082057.622983-1-pan...@google.com
- Fix bug where trace not emitted with malformed aer_err_info
- Extend ratelimit to malformed aer_err_info
- Update commit messages with patch motivation
- Squash AER sysfs filename change (Patch 8)
v3: https://lore.kernel.org/r/20250319084050.366718-1-pan...@google.com
- Ratelimit aer_print_port_info() (drop Patch 1)
- Add ratelimit enable toggle
- Move trace outside of ratelimit
- Split log level (Patch 2) into two
- More descriptive documentation/sysfs naming
v2: https://lore.kernel.org/r/20250214023543.992372-1-pan...@google.com
- Rebased on top of pci/aer (6.14.rc-1)
- Split series into log and IRQ ratelimits (defer patch 5)
- Dropped patch 8 (Move AER sysfs)
- Added log level cleanup patch[7] from Karolina's series
- Fixed bug where dpc errors didn't increment counters
- "X callbacks suppressed" message on ratelimit release -> immediately
- Separate documentation into own patch
v1: https://lore.kernel.org/r/20250115074301.3514927-1-pan...@google.com

[1] https://bugzilla.kernel.org/show_bug.cgi?id=215027
[2] https://bugzilla.kernel.org/show_bug.cgi?id=201517
[3] https://bugzilla.kernel.org/show_bug.cgi?id=196183
[4] 
https://lore.kernel.org/linux-pci/20230606035442.2886343-2-grund...@chromium.org/
[5] 
https://lore.kernel.org/linux-pci/cover.1736341506.git.karolina.stola...@oracle.com/
[6]
https://lore.kernel.org/linux-pci/8bcb8c9a7b38ce3bdaca5a64fe76f08b0b337511.1742202797.git.k
arolina.stola...@oracle.com/
[7]
https://lore.kernel.org/linux-pci/edd77011aafad4c0654358a26b4e538d0c5a321d.1736341506.git.k
arolina.stola...@oracle.com/


Bjorn Helgaas (9):
  PCI/DPC: Initialize aer_err_info before using it
  PCI/DPC: Log Error Source ID only when valid
  PCI/AER: Consolidate Error Source ID logging in aer_print_port_info()
  PCI/AER: Extract bus/dev/fn in aer_print_port_info() with
    PCI_BUS_NUM(), etc
  PCI/AER: Move aer_print_source() earlier in file
  PCI/AER: Initialize aer_err_info before using it
  PCI/AER: Simplify pci_print_aer()
  PCI/AER: Update statistics early in logging
  PCI/AER: Combine trace_aer_event() with statistics updates

Jon Pan-Doh (4):
  PCI/AER: Rename aer_print_port_info() to aer_print_source()
  PCI/AER: Introduce ratelimit for error logs
  PCI/AER: Add ratelimits to PCI AER Documentation
  PCI/AER: Add sysfs attributes for log ratelimits

Karolina Stolarek (3):
  PCI/AER: Check log level once and remember it
  PCI/AER: Make all pci_print_aer() log levels depend on error type
  PCI/AER: Rename struct aer_stats to aer_report

 ...es-aer_stats => sysfs-bus-pci-devices-aer} |  34 ++
 Documentation/PCI/pcieaer-howto.rst           |  16 +-
 drivers/pci/pci-sysfs.c                       |   1 +
 drivers/pci/pci.h                             |   5 +-
 drivers/pci/pcie/aer.c                        | 346 ++++++++++++------
 drivers/pci/pcie/dpc.c                        |  49 ++-
 include/linux/pci.h                           |   2 +-
 7 files changed, 329 insertions(+), 124 deletions(-)
 rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => 
sysfs-bus-pci-devices-aer} (77%)

-- 
2.43.0


Reply via email to