Re: [Beowulf] Servers Too Hot? Intel Recommends a Luxurious Oil Bath

Lux, Jim (337C) Wed, 05 Sep 2012 11:29:20 -0700

Yes.. this is something that has been researched and tested in the laboratory.  
I don't know that anyone has actually tried reconfiguring around a damaged 
piece of an FPGA, if for no other reason than permanent damage in a 
reconfigurable FPGA is extremely unusual (and probably hasn't ever occurred).  
There are soft upsets in the configuration memory, and the Virtex and Virtex II 
have a potential failure mode where an upset in just the wrong place could 
cause damage (having two logic element outputs fighting each other), but it's 
very unlikely.


There's a fair amount of test data on radiation behavior (klabs.org or MAPLD 
are places to look).  I'm not sure there's a failure mechanism (with high 
enough probability) that causes a hard failure of just some gates. (These parts 
are typically latchup-immune, for instance).  I suppose some sufficiently high 
energy particle could damage a few gates permanently.  You'd need very high 
Linear Energy Transfer, though.   There's a paper by Fuller, et al, out there 
where they zapped a Virtex with 2068 MeV Au ions, looking to see if latchup 
could be observed at any LET below 125 MeV-cm^2/mg  (this is the upper bound 
for galactic cosmic rays).  No latchup detected.  They did see an increase in 
current, but it's because of the configuration upsets causing internal logic 
contention, and went away when the device was reconfigured.  (fluence was 
1E7-1E8 ions/cm^2, which is HUGE compared to what you see in real life.  There 
were some changes in current that stuck around for a few hours, but gradually 
annealed away)

As far as upsets go, typical predicted upset rates aer on the order of 2 
upsets/device day in LEO up to 5.9 upsets/device day in GEO.  With flare 
enhancement, it's like 21 upsets/device day for LEO and 81.5 for GEO.  (of 
course, life is better than this.. in most designs, the vast majority of 
configuration bits are "don't care", so you wouldn't see the upset..  a typical 
multiplier is 4:1. That is, half an upset/device day for LEO)  (all these are 
for the XVQR300)
(another source reports a cross section for proton SEU of 5E-13 cm^2/bit.. the 
device has, say, 6E6 bits, so you can figure out what kind of proton flux you 
need to get a given upset rate)

And, of course, now there's a rad hardened Virtex 5 available (you too can own 
one for about $80k/copy).. 1Mrad(Si) total dose, config mem upset rate in GEO 
3.8E-10 errors/bit/day. Single Event Functional Interrupt (SEFI) of 
configuration control logic (this would prevent you from reconfiguring on the 
fly) in GEO is once every 10,000 years.

So it's not really clear that you NEED to be able to reconfigure around damage..

We've only been flying Xilinx Virtex parts for long durations since 2005 (Mars 
Reconnaissance Orbiter) (there might be some other earlier experiments.. CANDOS 
used a couple of Virtex II parts on Shuttle was 2 weeks in 2003 and only 
operated for 10s of hours) We do periodic scrubbing/reloading of the 
configuration memory, and I'm not sure we even know if there was a transient 
upset (that is, we don't read it back, we just rewrite, blindly).  There's some 
DoD comm payloads that use Virtex parts, and their mitigation strategy for 
configuration upsets is to have two devices and ping pong between them.. while 
chip 1 is being configured, use chip 2, when done, flip, reconfigure chip 2 and 
use chip 1.

When all is said and done, reconfiguring to get around a human coding error is 
actually much more likely.

Jim Lux

From: [email protected] [mailto:[email protected]] On 
Behalf Of Nathan Moore
Sent: Wednesday, September 05, 2012 8:24 AM
To: [email protected]
Subject: Re: [Beowulf] Servers Too Hot? Intel Recommends a Luxurious Oil Bath


> On Tue, 4 Sep 2012, Ellis H. Wilson III wrote:
Which is why I was suggesting that, "Maybe the whole thing is just
built, sealed for good, primed with [hydrogen/oil/he/whatever], started,
allowed to slowly degrade over time and finally tossed when the still
working equipment is sufficiently dated."

I remember an "ancient" IBM technical article about the BlueGene, here: 
http://researcher.watson.ibm.com/researcher/files/us-ajayr/SysJ_BlueGene.pdf

In the work (or maybe it was a closely related paper), the authors make the 
point that as core count increases and feature size decreases, cpu units will 
have to be fault tolerant, eg if cosmic rays have toasted 10% of your chip's 
cores, it should still be able to function.  Related, this is one of the great 
beauties of FPGA's.  Jim Lux can probably tell us if this would be real, but it 
would seem to make sense to program a space probe (ie voyager type) with an 
FPGA emulated CPU for the sake of damage survivability.  In the worst case that 
the probe encounters something unpleasant and part of the FPGA is damaged, 
perhaps the rest of the LUT's in the FPGA could be reprogrammed to produce a 
less powerful, yet still functional, controller.  This would take the "field 
programmable" aspect to the device to a new height...

Nathan

_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Servers Too Hot? Intel Recommends a Luxurious Oil Bath

Reply via email to