It has been on my to-do list for a while to start a FAQ listing of the various resilience/FT related activities in and around Open MPI. This would provide a starting location for users and new developers could go to for an overview of each of the features, and how to activate/use the feature.
I'll try to bump that up the priority list and post a message once it is ready. Probably a month or so off since I need to collect some information from various developers. -- Josh On Sun, Jun 26, 2011 at 6:01 PM, Ralph Castain <r...@open-mpi.org> wrote: > I think we're some ways away from declaring a "resilient ORTE". Josh and I > have been committing pieces of it over the last two years, and Wes just > committed another piece the other day that might have been titled "fault > tolerant OOB" as it primarily addressed maintaining comm routing during node > failures. > > Setting aside the obvious MPI issues, there are several > branches/organizations working different aspects of the ORTE problem, > including: > > * fault prediction and proactive migration > > * mapping algorithms to minimize failure cascades > > * simultaneous failure handling > > * alternative wiring methods that eliminate the OOB routing issues > > etc. We expect most of those developments to arrive over the next 6-12 > months. Once that has occurred, we'll probably be close to what we would > call a "resilient" system. > > Until then, we are improving, but still far from "resilient". > > > On Jun 24, 2011, at 10:24 AM, Ken Lloyd wrote: > > Josh and Wesley, > > Will you be presenting Resilient ORTE at Resilience 2011 in Bordeaux? > > http://xcr.cenit.latech.edu/resilience2011/ > > ===================== > *Kenneth A. Lloyd* > CEO - Director of Systems Science > Watt Systems Technologies Inc. > www.wattsys.com > kenneth.ll...@wattsys.com > > This e-mail is covered by the Electronic Communications Privacy Act, 18 > U.S.C. 2510-2521 and is intended only for the addressee named above. It may > contain privileged or confidential information. If you are not the addressee > you must not copy, distribute, disclose or use any of the information in it. > If you have received it in error please delete it and immediately notify the > sender. > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey