kevin.bowl...@kev009.com (Kevin Bowling) writes: >Servers tend to have BMCs, so you can execute 'ipmitool sensors' and >'ipmi sel elist' to get the information out.
ECC information is usually not provided by sensors. ECC errors may be listed in the SEL, but even this usually occurs only when some undocumented limit is reached. Often the messages also do not indicate the memory module that produced the error. >Linux has the 'EDAC' subsystem but I don't think it gains you so much >if you have a BMC. It gives you the data from the ECC circuits, immediately. So data is no longer hidden by the BMC, you get precise information and you can apply your own policies for e.g. replacing memory modules or migrating services to other hardware. The OS could be smart, lock out bad memory regions, recover some errors by e.g. paging in text data again or even use mirrored RAM (with motherboard support). >A lot of fragile chipset specific code to get that. Indeed. Greetings,