At 09:43 AM 11/16/2007, Peter St. John wrote:
David,
I just asked the local NT goon, "do you use ECC for the servers?" and
he answered, "you have to". What he considers a server-class mobo
requires ECC and he added that the tendency is now to FB-DIMM (fully
buffered, http://en.wikipedia.org/wiki/FBDIMM). This suggests to me
that next year(s) commodity mobos will be ECC.

Of course the additional expense keeps your question interesting for
now. I would imagine that if something is done to cover **software**
errors, which are aeternal :-), such as periodic checkpointing, then
adding memcheck stuff as Tony suggests seems reasonable.

The cluster environment adds some interesting aspects to the problem. If you have only one computer, then ECC (or even more robust, something like triple modular redundancy, TMR) is an easy way to go. Especially because there's no software development overhead. FWIW, we use the term EDAC, Error Detection and Correction to refer to the whole thing. An ECC (Error Correcting Code) is just the particular coding used in an overall EDAC approach. Common ECCs are the Hamming codes with 3 syndrome bits for 8 data bits, etc.

However, say you had a billion work packets to do, and you're processing them on 1000 machines. If the work packet has some mechanism for self check, maybe a strategy is just to redo the work when the check fails. If you have a rate constraint, then you can add extra processors to keep the work rate up. Assuming here that you have a trade between cheap, error prone and expensive error-free computers.

In some applications where there's a hard real time constraint, the option of 'do-over' doesn't exist, so you're forced to a fine grained redundancy (EDAC or TMR).

Likewise, if your work quantas are not amenable to a do-over (say, all 1000 processors have to participate lockstep in the next time step, so having one die means all wait til it's done).

Then, as you get into ECC, there's a whole lot of other issues... for instance, you can do "software ecc" (this is popular on spacecraft)... store critical values 3 times in different locations, and then, before using them, do the compare and vote. This works if your upset rate is low (i.e. you're not worried about a hit in the CPU, but in something that is resident in memory for a long time) AND if the access rate to that critical information is low.

What these strategies attempt to do is spend more resources on bits with more value. (For instance, if you're transmitting digitized voice or music, a hit on the MSB is more audible than the LSB, so you might be able to choose a ECC that protects those bits better, giving up some protection on the others.. Maybe you have 16 data bits, and you use the code only on the top 8, so you've got 19 total bits to transmit 16, rather than 22 for 16 in a conventional Hamming code) With compressed video this gets very interesting.. Errors in the coarse resolution blocks are MUCH more visible than errors in the little blocks.

OTOH.. it rapidly gets to where it's easier just to EDAC everything. If you're building ASICs or trying to squeeze every last bit per second out of your channel, it's worth it to tailor things. If you're writing software, and the labor for that is the big cost contributor, spending money on blanket EDAC is a no-brainer.


Jim Lux



_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to