Re: [OMPI devel] [devel-core] [RFC] Exit without finalize

Jeff Squyres Tue, 11 Sep 2007 13:03:32 -0400

On Sep 8, 2007, at 2:33 PM, Aurelien Bouteiller wrote:

I agree (b) is not a good idea. However I am not very pleased by (a)
either. It totally prevent any process Fault Tolerant mechanism if we
go that way. If we plan to add some failure detection mechanism to
RTE and failure management (to avoid Finalize to hang), we should add
the ability to plug-in FT specific error handlers. The default error
handler should do exactly what is proposed by Ralph, but nowhere else
(than in this handler) the RTE code should assume that the
application is aborting when a failure occurs. If it is a FT
application it might just not abort and recover.


(b) sounds fine to me.

If you genericize the concept, I think it's compatible with FT:

1. during MPI_INIT, one of the MPI processes can request a "notify"exit pattern for the job: a process must notify the RTE before itactually exits (i.e., some ORTE notification during MPI_FINALIZE).If a process exits before notifying the RTE, it's an error.


1a. The default action upon error can be to kill the entire job.

1b. If you desire plug-in-able error actions (e.g., not kill theentire job), I'm *assuming* that our plugin frameworks can handlethat...?

2. for an FT MPI job, I assume that the MPI processes would eithernot perform step 1 (i.e., the default action upon process exit isnothing -- just like if you had run "mpirun -np 4 hostname"), or youwould select a specific action upon error/plugin for what to do whena process exits without first notifying the RTE.


Howzat?

--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] [devel-core] [RFC] Exit without finalize

Reply via email to