Re: [Beowulf] Not quite Walmart, or, living without ECC?

Jim Lux Fri, 16 Nov 2007 14:37:30 -0800

At 09:43 AM 11/16/2007, Peter St. John wrote:

David,
I just asked the local NT goon, "do you use ECC for the servers?" and
he answered, "you have to". What he considers a server-class mobo
requires ECC and he added that the tendency is now to FB-DIMM (fully
buffered, http://en.wikipedia.org/wiki/FBDIMM). This suggests to me
that next year(s) commodity mobos will be ECC.


Of course the additional expense keeps your question interesting for
now. I would imagine that if something is done to cover **software**
errors, which are aeternal :-), such as periodic checkpointing, then
adding memcheck stuff as Tony suggests seems reasonable.

The cluster environment adds some interesting aspects to the problem.If you have only one computer, then ECC (or even more robust,something like triple modular redundancy, TMR) is an easy way to go.Especially because there's no software development overhead. FWIW, weuse the term EDAC, Error Detection and Correction to refer to thewhole thing. An ECC (Error Correcting Code) is just the particularcoding used in an overall EDAC approach. Common ECCs are the Hammingcodes with 3 syndrome bits for 8 data bits, etc.

However, say you had a billion work packets to do, and you'reprocessing them on 1000 machines. If the work packet has somemechanism for self check, maybe a strategy is just to redo the workwhen the check fails. If you have a rate constraint, then you canadd extra processors to keep the work rate up. Assuming here thatyou have a trade between cheap, error prone and expensive error-free computers.

In some applications where there's a hard real time constraint, theoption of 'do-over' doesn't exist, so you're forced to a fine grainedredundancy (EDAC or TMR).

Likewise, if your work quantas are not amenable to a do-over (say,all 1000 processors have to participate lockstep in the next timestep, so having one die means all wait til it's done).

Then, as you get into ECC, there's a whole lot of other issues... forinstance, you can do "software ecc" (this is popular onspacecraft)... store critical values 3 times in different locations,and then, before using them, do the compare and vote. This works ifyour upset rate is low (i.e. you're not worried about a hit in theCPU, but in something that is resident in memory for a long time) ANDif the access rate to that critical information is low.

What these strategies attempt to do is spend more resources on bitswith more value. (For instance, if you're transmitting digitizedvoice or music, a hit on the MSB is more audible than the LSB, so youmight be able to choose a ECC that protects those bits better, givingup some protection on the others.. Maybe you have 16 data bits, andyou use the code only on the top 8, so you've got 19 total bits totransmit 16, rather than 22 for 16 in a conventional Hammingcode) With compressed video this gets very interesting.. Errors inthe coarse resolution blocks are MUCH more visible than errors in thelittle blocks.

OTOH.. it rapidly gets to where it's easier just to EDACeverything. If you're building ASICs or trying to squeeze every lastbit per second out of your channel, it's worth it to tailorthings. If you're writing software, and the labor for that is thebig cost contributor, spending money on blanket EDAC is a no-brainer.



Jim Lux



_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Not quite Walmart, or, living without ECC?

Reply via email to