kevin.bowl...@kev009.com (Kevin Bowling) writes:

>Servers tend to have BMCs, so you can execute 'ipmitool sensors' and
>'ipmi sel elist' to get the information out.

ECC information is usually not provided by sensors. ECC errors may
be listed in the SEL, but even this usually occurs only when some
undocumented limit is reached. Often the messages also do not indicate
the memory module that produced the error.


>Linux has the 'EDAC' subsystem but I don't think it gains you so much
>if you have a BMC.

It gives you the data from the ECC circuits, immediately. So data is
no longer hidden by the BMC, you get precise information and you can
apply your own policies for e.g. replacing memory modules or migrating
services to other hardware.

The OS could be smart, lock out bad memory regions, recover some
errors by e.g. paging in text data again or even use mirrored RAM
(with motherboard support).


>A lot of fragile chipset specific code to get that.

Indeed.


Greetings,

Reply via email to