On Sep 8, 2007, at 2:33 PM, Aurelien Bouteiller wrote:

I agree (b) is not a good idea. However I am not very pleased by (a)
either. It totally prevent any process Fault Tolerant mechanism if we
go that way. If we plan to add some failure detection mechanism to
RTE and failure management (to avoid Finalize to hang), we should add
the ability to plug-in FT specific error handlers. The default error
handler should do exactly what is proposed by Ralph, but nowhere else
(than in this handler) the RTE code should assume that the
application is aborting when a failure occurs. If it is a FT
application it might just not abort and recover.

(b) sounds fine to me.

If you genericize the concept, I think it's compatible with FT:

1. during MPI_INIT, one of the MPI processes can request a "notify" exit pattern for the job: a process must notify the RTE before it actually exits (i.e., some ORTE notification during MPI_FINALIZE). If a process exits before notifying the RTE, it's an error.

1a. The default action upon error can be to kill the entire job.
1b. If you desire plug-in-able error actions (e.g., not kill the entire job), I'm *assuming* that our plugin frameworks can handle that...?

2. for an FT MPI job, I assume that the MPI processes would either not perform step 1 (i.e., the default action upon process exit is nothing -- just like if you had run "mpirun -np 4 hostname"), or you would select a specific action upon error/plugin for what to do when a process exits without first notifying the RTE.

Howzat?

--
Jeff Squyres
Cisco Systems

Reply via email to