Ah - ok, thanks for clarifying! I'm happy to leave it around, but wasn't sure if/where it fit into anyone's future plans.
Thanks Ralph On 3/6/08 9:13 AM, "Josh Hursey" <jjhur...@open-mpi.org> wrote: > The checkpoint/restart work that I have integrated does not respond to > failure at the moment. If a failures happens I want ORTE to terminate > the entire job. I will then restart the entire job from a checkpoint > file. This follows the 'all fall down' approach that users typically > expect when using a global C/R technique. > > Eventually I want to integrate something better where I can respond to > a failure with a recovery from inside ORTE. I'm not there yet, but > hopefully in the near future. > > I'll let the UTK group talk about what they are doing with ORTE, but I > suspect they will be taking advantage of the errmgr to help respond to > failure and restart a single process. > > > It is important to consider in this context that we do *not* always > want ORTE to abort whenever it detects a process failure. This is the > default mode for MPI applications (MPI_ERRORS_ARE_FATAL), and should > be supported. But there is another mode in which we would like ORTE to > keep running to conform with (MPI_ERRORS_RETURN): > http://www.mpi-forum.org/docs/mpi-11-html/node148.html > > It is known that certain standards conformant MPI "fault tolerant" > programs do not work in Open MPI for various reasons some in the > runtime and some external. Here we are mostly talking about > disconnected fates of intra-communicator groups. I have a test in the > ompi-tests repository that illustrates this problem, but I do not have > time to fix it at the moment. > > > So in short keep the errmgr around for now. I suspect we will be using > it, and possibly tweaking it in the nearish future. > > Thanks for the observation. > > Cheers, > Josh > > On Mar 6, 2008, at 10:44 AM, Ralph Castain wrote: > >> Hello >> >> I've been doing some work on fault response within the system, and >> finally >> realized something I should probably have seen awhile back. Perhaps >> I am >> misunderstanding somewhere, so forgive the ignorance if so. >> >> When we designed ORTE some time in the deep, dark past, we had >> envisioned >> that people might want multiple ways of responding to process faults >> and/or >> abnormal terminations. You might want to just abort the job, attempt >> to >> restart just that proc, attempt to restart the job, etc. To support >> these >> multiple options, and to provide a means for people to simply try >> new ones, >> we created the errmgr framework. >> >> Our thought was that a process and/or daemon would call the errmgr >> when we >> detected something abnormal happening, and that the selected errmgr >> component could then do whatever fault response was desired. >> >> However, I now see that the fault tolerance mechanisms inside of >> OMPI do not >> seem to be using that methodology. Instead, we have hard-coded a >> particular >> response into the system. >> >> If we configure without FT, we just abort the entire job since that >> is the >> only errmgr component that exists. >> >> If we configure with FT, then we execute the hard-coded C/R >> methodology. >> This is built directly into the code, so there is no option as to what >> happens. >> >> Is there a reason why the errmgr framework was not used? Did the FT >> team >> decide that this was not a useful tool to support multiple FT >> strategies? >> Can we modify it to better serve those needs, or is it simply not >> feasible? >> >> If it isn't going to be used for that purpose, then I might as well >> remove >> it. As things stand, there really is no purpose served by the errmgr >> framework - might as well replace it with just a function call. >> >> Appreciate any insights >> Ralph >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel