On Tuesday, April 29, 2025 8:20:55 PM Central European Summer Time Yazen Ghannam wrote: > On Tue, Apr 29, 2025 at 07:21:09PM +0200, Fabio M. De Francesco wrote: > > When Firmware First is enabled, BIOS handles errors first and then it > > makes them available to the kernel via the Common Platform Error Record > > (CPER) sections (UEFI 2.10 Appendix N). Linux parses the CPER sections > > via one of two similar paths, either ELOG or GHES. > > > > Currently, ELOG and GHES show some inconsistencies in how they report to > > userspace via trace events. > > > > Therfore make the two mentioned paths act similarly by tracing the CPER > > CXL Protocol Error Section (UEFI v2.10, Appendix N.2.13) signaled by the > > I/O Machine Check Architecture and reported by BIOS in FW-First. > > > > Cc: Dan Williams <dan.j.willi...@intel.com> > > Signed-off-by: Fabio M. De Francesco <fabio.m.de.france...@linux.intel.com> > > --- > > drivers/acpi/acpi_extlog.c | 60 ++++++++++++++++++++++++++++++++++++++ > > drivers/cxl/core/ras.c | 6 ++++ > > include/cxl/event.h | 2 ++ > > 3 files changed, 68 insertions(+) > > > > diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c > > index 7d7a813169f1..8f2ff3505d47 100644 > > --- a/drivers/acpi/acpi_extlog.c > > +++ b/drivers/acpi/acpi_extlog.c > > @@ -12,6 +12,7 @@ > > #include <linux/ratelimit.h> > > #include <linux/edac.h> > > #include <linux/ras.h> > > +#include <cxl/event.h> > > #include <acpi/ghes.h> > > #include <asm/cpu.h> > > #include <asm/mce.h> > > @@ -157,6 +158,60 @@ static void extlog_print_pcie(struct cper_sec_pcie > > *pcie_err, > > } > > } > > > > +static void > > +extlog_cxl_cper_handle_prot_err(struct cxl_cper_sec_prot_err *prot_err, > > + int severity) > > +{ > > +#ifdef CONFIG_ACPI_APEI_PCIEAER > > Why not apply this check on the function prototype? > This function is static. > > Reference: Documentation/process/coding-style.rst > Section 21) Conditional Compilation > > > + struct cxl_cper_prot_err_work_data wd; > > + u8 *dvsec_start, *cap_start; > > + > > + if (!(prot_err->valid_bits & PROT_ERR_VALID_AGENT_ADDRESS)) { > > + pr_err_ratelimited("CXL CPER invalid agent type\n"); > > + return; > > + } > > + > > + if (!(prot_err->valid_bits & PROT_ERR_VALID_ERROR_LOG)) { > > + pr_err_ratelimited("CXL CPER invalid protocol error log\n"); > > + return; > > + } > > + > > + if (prot_err->err_len != sizeof(struct cxl_ras_capability_regs)) { > > + pr_err_ratelimited("CXL CPER invalid RAS Cap size (%u)\n", > > + prot_err->err_len); > > + return; > > + } > > + > > + if (!(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER)) > > + pr_warn(FW_WARN "CXL CPER no device serial number\n"); > > Is this a requirement (in the spec) that we should warn users about? > > The UEFI spec says that serial number is only used if "CXL agent" is a > "CXL device". > > "CXL ports" won't have serial numbers. So this will be a false warning > for port errors. > I'll add a test and print that warning only if agent is a device (RCD, DEVICE, LD, FMLD). > > > + > > + switch (prot_err->agent_type) { > > + case RCD: > > + case DEVICE: > > + case LD: > > + case FMLD: > > + case RP: > > + case DSP: > > + case USP: > > + memcpy(&wd.prot_err, prot_err, sizeof(wd.prot_err)); > > + > > + dvsec_start = (u8 *)(prot_err + 1); > > + cap_start = dvsec_start + prot_err->dvsec_len; > > + > > + memcpy(&wd.ras_cap, cap_start, sizeof(wd.ras_cap)); > > + wd.severity = cper_severity_to_aer(severity); > > + break; > > + default: > > + pr_err_ratelimited("CXL CPER invalid agent type: %d\n", > > "invalid" is too harsh given that the specs may be updated. Maybe say > "reserved" or "unknown" or "unrecognized" instead. > > Hopefully things will settle down to where a user will be able to have a > system with newer CXL "agents" without *requiring* a kernel update. :) > I'll replace "invalid" with "unknown". > > Thanks, > Yazen > Thanks,
Fabio