Re: [OMPI devel] Fault tolerance

Ralph Castain Thu, 6 Mar 2008 11:17:53 -0500

Ah - ok, thanks for clarifying! I'm happy to leave it around, but wasn't
sure if/where it fit into anyone's future plans.


Thanks
Ralph



On 3/6/08 9:13 AM, "Josh Hursey" <jjhur...@open-mpi.org> wrote:

> The checkpoint/restart work that I have integrated does not respond to
> failure at the moment. If a failures happens I want ORTE to terminate
> the entire job. I will then restart the entire job from a checkpoint
> file. This follows the 'all fall down' approach that users typically
> expect when using a global C/R technique.
> 
> Eventually I want to integrate something better where I can respond to
> a failure with a recovery from inside ORTE. I'm not there yet, but
> hopefully in the near future.
> 
> I'll let the UTK group talk about what they are doing with ORTE, but I
> suspect they will be taking advantage of the errmgr to help respond to
> failure and restart a single process.
> 
> 
> It is important to consider in this context that we do *not* always
> want ORTE to abort whenever it detects a process failure. This is the
> default mode for MPI applications (MPI_ERRORS_ARE_FATAL), and should
> be supported. But there is another mode in which we would like ORTE to
> keep running to conform with (MPI_ERRORS_RETURN):
>   http://www.mpi-forum.org/docs/mpi-11-html/node148.html
> 
> It is known that certain standards conformant MPI "fault tolerant"
> programs do not work in Open MPI for various reasons some in the
> runtime and some external. Here we are mostly talking about
> disconnected fates of intra-communicator groups. I have a test in the
> ompi-tests repository that illustrates this problem, but I do not have
> time to fix it at the moment.
> 
> 
> So in short keep the errmgr around for now. I suspect we will be using
> it, and possibly tweaking it in the nearish future.
> 
> Thanks for the observation.
> 
> Cheers,
> Josh
> 
> On Mar 6, 2008, at 10:44 AM, Ralph Castain wrote:
> 
>> Hello
>> 
>> I've been doing some work on fault response within the system, and
>> finally
>> realized something I should probably have seen awhile back. Perhaps
>> I am
>> misunderstanding somewhere, so forgive the ignorance if so.
>> 
>> When we designed ORTE some time in the deep, dark past, we had
>> envisioned
>> that people might want multiple ways of responding to process faults
>> and/or
>> abnormal terminations. You might want to just abort the job, attempt
>> to
>> restart just that proc, attempt to restart the job, etc. To support
>> these
>> multiple options, and to provide a means for people to simply try
>> new ones,
>> we created the errmgr framework.
>> 
>> Our thought was that a process and/or daemon would call the errmgr
>> when we
>> detected something abnormal happening, and that the selected errmgr
>> component could then do whatever fault response was desired.
>> 
>> However, I now see that the fault tolerance mechanisms inside of
>> OMPI do not
>> seem to be using that methodology. Instead, we have hard-coded a
>> particular
>> response into the system.
>> 
>> If we configure without FT, we just abort the entire job since that
>> is the
>> only errmgr component that exists.
>> 
>> If we configure with FT, then we execute the hard-coded C/R
>> methodology.
>> This is built directly into the code, so there is no option as to what
>> happens.
>> 
>> Is there a reason why the errmgr framework was not used? Did the FT
>> team
>> decide that this was not a useful tool to support multiple FT
>> strategies?
>> Can we modify it to better serve those needs, or is it simply not
>> feasible?
>> 
>> If it isn't going to be used for that purpose, then I might as well
>> remove
>> it. As things stand, there really is no purpose served by the errmgr
>> framework - might as well replace it with just a function call.
>> 
>> Appreciate any insights
>> Ralph
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Fault tolerance

Reply via email to