apei: Add NVIDIA GHES vendor CPER record handler

Jonathan Cameron Wed, 25 Mar 2026 10:23:00 -0700

On Wed, 25 Mar 2026 10:36:28 -0500
Bjorn Helgaas <[email protected]> wrote:


> On Wed, Mar 25, 2026 at 07:34:50PM +0800, Kai-Heng Feng wrote:
> > On Wed Mar 25, 2026 at 12:15 AM CST, Bjorn Helgaas wrote:  
> > > On Tue, Mar 24, 2026 at 05:33:06PM +0800, Kai-Heng Feng wrote:  
> > >> On 2026-03-20 09:52, Bjorn Helgaas wrote:  
> > >> > On Thu, Mar 19, 2026 at 07:13:09PM +0800, Kai-Heng Feng wrote:  
> > >> > > Add support for decoding NVIDIA-specific CPER sections delivered via
> > >> > > the APEI GHES vendor record notifier chain. NVIDIA hardware generates
> > >> > > vendor-specific CPER sections containing error signatures and 
> > >> > > diagnostic
> > >> > > register dumps. This implementation registers a notifier_block with 
> > >> > > the
> > >> > > GHES vendor record notifier and decodes these sections, printing 
> > >> > > error
> > >> > > details via dev_info().
> > >> > >
> > >> > > The driver binds to ACPI device NVDA2012, present on NVIDIA server
> > >> > > platforms. The NVIDIA CPER section contains a fixed header with error
> > >> > > metadata (signature, error type, severity, socket) followed by
> > >> > > variable-length register address-value pairs for hardware 
> > >> > > diagnostics.
> > >> > >
> > >> > > This work is based on libcper [0].
> > >> > >
> > >> > > Example output:
> > >> > > nvidia-ghes NVDA2012:00: NVIDIA CPER section, error_data_length: 544
> > >> > > nvidia-ghes NVDA2012:00: signature: CMET-INFO
> > >> > > nvidia-ghes NVDA2012:00: error_type: 0
> > >> > > nvidia-ghes NVDA2012:00: error_instance: 0
> > >> > > nvidia-ghes NVDA2012:00: severity: 3
> > >> > > nvidia-ghes NVDA2012:00: socket: 0
> > >> > > nvidia-ghes NVDA2012:00: number_regs: 32
> > >> > > nvidia-ghes NVDA2012:00: instance_base: 0x0000000000000000
> > >> > > nvidia-ghes NVDA2012:00: register[0]: address=0x8000000100000000 
> > >> > > value=0x0000000100000000  
> > >> >
> > >> > Is there a convenient way to connect NVDA2012:00 with the actual
> > >> > device?  I assume this is typically a PCIe device?  How would we
> > >> > relate this with PCIe errors?  
> > >>
> > >> The CPER report is from ARM RAS firmware and not neccessarily be
> > >> related to a PCIe device.  
> > >
> > > Right, I know CPER is more general than just PCI/PCIe.
> > >
> > > But in this case, I think NVDA2012 probably *is* a PCIe device.  How
> > > would we figure out which one?  If we have to manually do an acpidump,
> > > figure out which NVDA2012 is :00, and look for an _ADR or something,
> > > that doesn't really seem convenient for multi-NVDA2012 situations.  
> > 
> > It's actually just an ACPI device:
> > Device (CPER)
> > {
> >   Name (_HID, "NVDA2012")  // _HID: Hardware ID
> >   Name (_UID, 0x00)  // _UID: Unique ID
> >   Method (_DSM, 4, Serialized) // _DSM: Device-Specific Method
> > }
> > 
> > And that's it.  
> 
> Weird.  There's nothing for a driver to operate the device with except
> _DSM?  The device doesn't need any MMIO resources?  I would expect some
> resources described by a _CRS method or some native enumeration protocol
> like PCI BARs.
> 
> The _UID 0x00 matches the "00" in "NVDA2012:00", but I think that's a
> coincidence; I think the "00" in the device name came from the ida_alloc()
> in acpi_device_set_name(), not from _UID.
> 
> So I still don't know how you would identify the correct part in a system
> with multiple NVDA2012 devices.  I do see the "socket" and "instance_base"
> in the output.  Maybe that would help, but those seem to be
> device-specific, and it seems like we should have a generic mechanism.

It's not unique in ACPI terms.  There are a few cases even in the ACPI spec
of IDs that exist just to say some feature is there.

ACPI0017 is an example. Simply says, there be CXL here, go look for the
tables.

Here this device is used to indicate that a platform should be ready to handle
a particular type of error record.  If it happened to expose any other
interfaces, then I agree it would need resources or a _DSM etc.

Basically it's a workaround for the lack of discoverability in APEI /
ACPI error reporting. Could use an _OSC bit for the same job but then
we'd run out of those fast.  Device IDs are near free.

Jonathan


>

Re: [PATCH v2 3/3] acpi/apei: Add NVIDIA GHES vendor CPER record handler

Reply via email to