Dennis Kubes wrote:
Do we know if this is a hardware issue. If it is possibly a software issue I can dedicate some resources to tracking down bugs. I would just need a little guidance on where to start looking?

We don't know. The checksum mechanism is designed to catch hardware problems. So one must certainly consider that as a likely cause. If it is instead a software bug then it should be reproducible. Are you seeing any consistent patterns? If not, then I'd lean towards hardware.

Michael Stack has some experience tracking down problems with flaky memory. Michael, did you use a test program to validate the memory on a node?

Again, do your nodes have ECC memory?

Doug

Reply via email to