On Jun 27, 2011, at 6:57 AM, Ken Lloyd wrote: > One point I've been trying to put forward in my domain is, currently, high > performance computing != high reliability computing. Not by a long shot. > Seems that they are orthogonally coupled.
I think that has been true in the past - an emerging community is trying to bring the two back together, but the tradeoffs do pose challenges. In some ways, the RTE part of the equation is more manageable than the MPI side, IMO. > > There are many pieces to this problem-puzzle. Some of these pieces are > inter-related. Some of my work has dealt with adaptive approaches - > especially re: cascade, and what Ralph refers to as "rewiring", or routing > issues. Most of my development is taking place in the embedded world re ORCM (an OMPI-related project based on ORTE). I try to port most of it back to the OMPI trunk, but have fallen woefully behind over the last six months or so. ORCM has recently started getting contributions from a couple of universities, one focused on prediction/migration and another on wireup, that should translate directly to OMPI. There is some code already in the trunk re mapping to avoid failure cascades. In my "spare" time, I continue to work on it. Always open to exchanging ideas :-) > > > If and when I have anything I believe meaningful to contribute, I will. > > On Mon, 2011-06-27 at 08:32 -0400, Josh Hursey wrote: >> It has been on my to-do list for a while to start a FAQ listing of the >> various resilience/FT related activities in and around Open MPI. This would >> provide a starting location for users and new developers could go to for an >> overview of each of the features, and how to activate/use the feature. >> >> >> I'll try to bump that up the priority list and post a message once it is >> ready. Probably a month or so off since I need to collect some information >> from various developers. >> >> >> -- Josh >> >> On Sun, Jun 26, 2011 at 6:01 PM, Ralph Castain <r...@open-mpi.org> wrote: >> I think we're some ways away from declaring a "resilient ORTE". Josh and I >> have been committing pieces of it over the last two years, and Wes just >> committed another piece the other day that might have been titled "fault >> tolerant OOB" as it primarily addressed maintaining comm routing during node >> failures. >> >> >> Setting aside the obvious MPI issues, there are several >> branches/organizations working different aspects of the ORTE problem, >> including: >> >> >> * fault prediction and proactive migration >> >> >> * mapping algorithms to minimize failure cascades >> >> >> * simultaneous failure handling >> >> >> * alternative wiring methods that eliminate the OOB routing issues >> >> >> etc. We expect most of those developments to arrive over the next 6-12 >> months. Once that has occurred, we'll probably be close to what we would >> call a "resilient" system. >> >> >> Until then, we are improving, but still far from "resilient". >> >> >> >> >> >> On Jun 24, 2011, at 10:24 AM, Ken Lloyd wrote: >> >> >>> >>> Josh and Wesley, >>> >>> Will you be presenting Resilient ORTE at Resilience 2011 in Bordeaux? >>> >>> http://xcr.cenit.latech.edu/resilience2011/ >>> >>> ===================== >>> Kenneth A. Lloyd >>> CEO - Director of Systems Science >>> Watt Systems Technologies Inc. >>> www.wattsys.com >>> kenneth.ll...@wattsys.com >>> >>> This e-mail is covered by the Electronic Communications Privacy Act, 18 >>> U.S.C. 2510-2521 and is intended only for the addressee named above. It may >>> contain privileged or confidential information. If you are not the >>> addressee you must not copy, distribute, disclose or use any of the >>> information in it. If you have received it in error please delete it and >>> immediately notify the sender. >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> >> -- >> Joshua Hursey >> Postdoctoral Research Associate >> Oak Ridge National Laboratory >> http://users.nccs.gov/~jjhursey >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel