On Apr 19, 2006, at 4:15 PM, Greg Watson wrote:
We've just run across a rather tricky issue. We're calling opal_event_loop() to dispatch orte events to an orted that has been launched separately. However if the orted dies for some reason (gets a signal or whatever) then opal_event_loop() is calling exit(). Needless to say, this is not good behavior us. Any suggestions on how to get around this problem?
Is the orted you are connecting to the "seed" daemon? I think the only time we should be exiting like that is if the orted was the seed daemon. I'm not sure what we want to do if that's the case -- it looks like we're calling errmgr.abort() when badness happens. I wonder if your application can provide its own errmgr component that provides an abort that doesn't actually abort? Just some off the cuff ideas -- Ralph could probably give a better idea of exactly what is happening...
Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/