On 2025/10/10 18:36, Breno Leitao wrote:
Introduce a generic infrastructure for tracking recoverable hardware
errors (HW errors that are visible to the OS but does not cause a panic)
and record them for vmcore consumption. This aids post-mortem crash
analysis tools by preserving a count and timestamp for the last
occurrence of such errors. On the other side, correctable errors, which
the OS typically remains unaware of because the underlying hardware
handles them transparently, are less relevant for crash dump
and therefore are NOT tracked in this infrastructure.

Add centralized logging for sources of recoverable hardware
errors based on the subsystem it has been notified.

hwerror_data is write-only at kernel runtime, and it is meant to be read
from vmcore using tools like crash/drgn. For example, this is how it
looks like when opening the crashdump from drgn.

        >>> prog['hwerror_data']
        (struct hwerror_info[1]){
                {
                        .count = (int)844,
                        .timestamp = (time64_t)1752852018,
                },
                ...

This helps fleet operators quickly triage whether a crash may be
influenced by hardware recoverable errors (which executes a uncommon
code path in the kernel), especially when recoverable errors occurred
shortly before a panic, such as the bug fixed by
commit ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them
when destroying the pool")

This is not intended to replace full hardware diagnostics but provides
a fast way to correlate hardware events with kernel panics quickly.

Rare machine check exceptions—like those indicated by mce_flags.p5 or
mce_flags.winchip—are not accounted for in this method, as they fall
outside the intended usage scope for this feature’s user base.

Suggested-by: Tony Luck <[email protected]>
Suggested-by: Shuai Xue <[email protected]>
Signed-off-by: Breno Leitao <[email protected]>
Reviewed-by: Shuai Xue <[email protected]>
---
Changes in v5:
- Move the headers to uapi file (Dave Hansen)
- Use atomic operations in the tracking struct (Dave Hansen)
- Drop the MCE enum type, and track MCE errors as "others"
- Document this feature better
- Link to v4: 
https://lore.kernel.org/r/[email protected]

Changes in v4:
- Split the error by hardware subsystem instead of kernel
   subsystem/driver (Shuai)
- Do not count the corrected errors, only focusing on recoverable errors (Shuai)
- Link to v3: 
https://lore.kernel.org/r/[email protected]

Changes in v3:
- Add more information about this feature in the commit message
   (Borislav Petkov)
- Renamed the function to hwerr_log_error_type() and use hwerr as
   suffix (Borislav Petkov)
- Make the empty function static inline (kernel test robot)
- Link to v2: 
https://lore.kernel.org/r/[email protected]

Changes in v2:
- Split the counter by recoverable error (Tony Luck)
- Link to v1: 
https://lore.kernel.org/r/[email protected]
---
  Documentation/driver-api/hw-recoverable-errors.rst | 60 ++++++++++++++++++++++
  arch/x86/kernel/cpu/mce/core.c                     |  4 ++
  drivers/acpi/apei/ghes.c                           | 36 +++++++++++++

For the APEI part,

Reviewed-by: Hanjun Guo <[email protected]>

Thanks
Hanjun

Reply via email to