It's a "hard problem"… The natural tendency is to solve the easy problems 
first, and only when backed into the corner, do you take on the hard problems.  
Or.. Someone comes out of the background with a really novel approach.  I'm 
sure folks thought about error correcting codes in an empirical way (e.g. 
Parity bits) but Hamming put it all together in a nice consistent theoretical 
framework.  Or Shannon, for that matter.


From: Deepak Singh <[email protected]<mailto:[email protected]>>
Date: Friday, November 23, 2012 11:45 AM
To: Jim Lux <[email protected]<mailto:[email protected]>>
Cc: Luc Vereecken <[email protected]<mailto:[email protected]>>, 
"[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>, 
"[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: [Beowulf] Supercomputers face growing resilience problems

And this is the bit that concerns me the most.  At scale you should only be 
making two assumptions: (1) everything breaks all the time (2) you will have 
network partitions.  Checkpoint/restart is a lazy option that has no place in 
modern software. Yet there doesn't seem to be a priority to go beyond 
checkpoint restart and rethinking software architecture. I would argue that's 
as much or more important than figuring out manycore.

On Fri, Nov 23, 2012 at 6:44 AM, Lux, Jim (337C) 
<[email protected]<mailto:[email protected]>> wrote:
a lot of HPC software design
assumes perfect hardware, or, that the hardware failure rate is
sufficiently low that a checkpoint/restart (or "do it all over from the
beginning") is an acceptable strategy.

_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to