Re: [OMPI devel] Fault tolerance

Josh Hursey Thu, 6 Mar 2008 11:13:42 -0500

The checkpoint/restart work that I have integrated does not respond tofailure at the moment. If a failures happens I want ORTE to terminatethe entire job. I will then restart the entire job from a checkpointfile. This follows the 'all fall down' approach that users typicallyexpect when using a global C/R technique.

Eventually I want to integrate something better where I can respond toa failure with a recovery from inside ORTE. I'm not there yet, buthopefully in the near future.

I'll let the UTK group talk about what they are doing with ORTE, but Isuspect they will be taking advantage of the errmgr to help respond tofailure and restart a single process.

It is important to consider in this context that we do *not* alwayswant ORTE to abort whenever it detects a process failure. This is thedefault mode for MPI applications (MPI_ERRORS_ARE_FATAL), and shouldbe supported. But there is another mode in which we would like ORTE tokeep running to conform with (MPI_ERRORS_RETURN):

 http://www.mpi-forum.org/docs/mpi-11-html/node148.html

It is known that certain standards conformant MPI "fault tolerant"programs do not work in Open MPI for various reasons some in theruntime and some external. Here we are mostly talking aboutdisconnected fates of intra-communicator groups. I have a test in theompi-tests repository that illustrates this problem, but I do not havetime to fix it at the moment.

So in short keep the errmgr around for now. I suspect we will be usingit, and possibly tweaking it in the nearish future.


Thanks for the observation.

Cheers,
Josh

On Mar 6, 2008, at 10:44 AM, Ralph Castain wrote:

Hello
I've been doing some work on fault response within the system, andfinallyrealized something I should probably have seen awhile back. PerhapsI am
misunderstanding somewhere, so forgive the ignorance if so.
When we designed ORTE some time in the deep, dark past, we hadenvisionedthat people might want multiple ways of responding to process faultsand/orabnormal terminations. You might want to just abort the job, attempttorestart just that proc, attempt to restart the job, etc. To supportthesemultiple options, and to provide a means for people to simply trynew ones,
we created the errmgr framework.
Our thought was that a process and/or daemon would call the errmgrwhen we
detected something abnormal happening, and that the selected errmgr
component could then do whatever fault response was desired.
However, I now see that the fault tolerance mechanisms inside ofOMPI do notseem to be using that methodology. Instead, we have hard-coded aparticular
response into the system.
If we configure without FT, we just abort the entire job since thatis the
only errmgr component that exists.
If we configure with FT, then we execute the hard-coded C/Rmethodology.
This is built directly into the code, so there is no option as to what
happens.
Is there a reason why the errmgr framework was not used? Did the FTteamdecide that this was not a useful tool to support multiple FTstrategies?Can we modify it to better serve those needs, or is it simply notfeasible?
If it isn't going to be used for that purpose, then I might as wellremove
it. As things stand, there really is no purpose served by the errmgr
framework - might as well replace it with just a function call.

Appreciate any insights
Ralph


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Fault tolerance

Reply via email to