On Thu, Feb 29, 2024 at 10:55:14AM -0000, Michael van Elst wrote:
> The OS could be smart, lock out bad memory regions, recover some
> errors by e.g. paging in text data again or even use mirrored RAM
> (with motherboard support).

IIRC Intel Icelake introduced mechanisms to enable kernels to recover
from poison data situations, but I don't know how far this has been
implemented.  Ideally an app could be given some sort of notification
about poisoned data instead of the kernel blindly panicing.

> >A lot of fragile chipset specific code to get that.
> 
> Indeed.

There's expectation that the platform-spceific bits would be abstracted
for now through ACPI, and eventually codified into a hardware RAS
controller with a standardized driver attached either as a PCIe function
or ACPI-discovered MMIO space.  Part of EDAC is not only getting
notifications of the errors, but being able to do mapping of physical
addresses back to physical components (DIMMs or CXL devices) so you know
what to replace or block.

-- 
  Aaron J. Grier | "Not your ordinary poofy goof." | agr...@poofygoof.com
  "The price of reliability is the pursuit of the utmost simplicity.  It
   is a price which the very rich find most hard to pay."  -- Tony Hoare

Reply via email to