It's not that there aren't solutions available for specific problems.. The challenge is that some of the solutions don't scale well OR that they are not generalized enough to handle the gamut of non-EP kinds of problems.
I don't think there will be a silver bullet that fixes everything, but I think we'll evolve to some classes of solutions to solve certain classes of problems. After all, we don't do the same error correction codes on memory and hard disk. But the basic underlying comment is right: a lot of HPC software design assumes perfect hardware, or, that the hardware failure rate is sufficiently low that a checkpoint/restart (or "do it all over from the beginning") is an acceptable strategy. This is fine.. It's hard enough to figure out how to parallelize/clusterize the solution (having taken some decades to do it). I'm confident that over the next few decades we'll figure out how to deal with unreliable hardware/software. (because, after all, software bugs are a problem too) On 11/23/12 2:29 AM, "Luc Vereecken" <[email protected]> wrote: >At the same time, there are API (e.g. HTCondor) that do not assume >successful communications or computation; they are used in large >distributed computing projects (SETI@HOME, FOLDING@HOME, distributed.net >(though I don't think they have a toolbox available)). For >embarrassingly parallel workloads, they can be a good match; for tightly >coupled workloads, not always. > >Luc > > > >On 11/23/2012 5:19 AM, Justin YUAN SHI wrote: >> The fundamental problem rests in our programming API. If you look at >> MPI and OpenMP carefully, you will find that these and all others have >> one common assumption: the application-level communication is always >> successful. >> >> We knew full well that this cannot be true. _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
