On Mon, Jul 21, 2025 at 03:13:40AM -0700, Breno Leitao wrote: > Introduce a generic infrastructure for tracking recoverable hardware > errors (HW errors that did not cause a panic) and record them for vmcore > consumption. This aids post-mortem crash analysis tools by preserving > a count and timestamp for the last occurrence of such errors. > > This patch adds centralized logging for three common sources of
"Add centralized... " > recoverable hardware errors: > > - PCIe AER Correctable errors > - x86 Machine Check Exceptions (MCE) > - APEI/CPER GHES corrected or recoverable errors > > hwerror_tracking is write-only at kernel runtime, and it is meant to be > read from vmcore using tools like crash/drgn. For example, this is how > it looks like when opening the crashdump from drgn. > > >>> prog['hwerror_tracking'] > (struct hwerror_tracking_info [3]){ > { > .count = (int)844, > .timestamp = (time64_t)1752852018, > }, > ... > I'm still missing the justification why rasdaemon can't be used here. You did explain it already in past emails. > +enum hwerror_tracking_source { > + HWE_RECOV_AER, > + HWE_RECOV_MCE, > + HWE_RECOV_GHES, > + HWE_RECOV_MAX, > +}; Are we confident this separation will serve all cloud dudes? > + > +#ifdef CONFIG_VMCORE_INFO > +void hwerror_tracking_log(enum hwerror_tracking_source src); > +#else > +void hwerror_tracking_log(enum hwerror_tracking_source src) {}; > +#endif > + > #endif /* LINUX_VMCORE_INFO_H */ > diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c > index e066d31d08f89..23d7ddcd55cdd 100644 > --- a/kernel/vmcore_info.c > +++ b/kernel/vmcore_info.c > @@ -31,6 +31,13 @@ u32 *vmcoreinfo_note; > /* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */ > static unsigned char *vmcoreinfo_data_safecopy; > > +struct hwerror_tracking_info { > + int __data_racy count; > + time64_t __data_racy timestamp; > +}; > + > +static struct hwerror_tracking_info hwerror_tracking[HWE_RECOV_MAX]; > + > Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type, > void *data, size_t data_len) > { > @@ -118,6 +125,17 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void) > } > EXPORT_SYMBOL(paddr_vmcoreinfo_note); > > +void hwerror_tracking_log(enum hwerror_tracking_source src) A function should have a verb in its name explaining what it does: hwerr_log_error_type() or so. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette