here's an interesting thought exercise and a real problem i have to tackle.
i have a researchers that want to run magma codes for three weeks or so at a time. the process is unfortunately sequential in nature and magma doesn't support check pointing (as far as i know) and (I don't know much about magma) So the question is; what kind of a system could one design/buy using any combination of hardware/software that would guarantee that this program would run for 3 wks or so and not fail and by "fail" i mean from some system type error, ie memory faulted, cpu faulted, network io slipped (nfs timeout) as opposed to "there's a bug in magma" which already bit us a few times there's probably some commercial or "unreleased" commercial product on the market that might fill this need, but i'm also looking for something "creative" as well three weeks isn't a big stretch compared to some of the others codes i've heard around the DOE that run for months, but it's still pretty painful to have a run go for three weeks and then fail 2.5 weeks in and have to restart. most modern day hardware would probably support this without issue, but i'm looking for more of a guarantee then a prayer double bonus points for anything that runs at high clock speeds >3Ghz any thoughts? _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf