On Aug 25, 2007, at 4:32 PM, Jeff Squyres wrote:
I unfortunately do not remember whether I put that recursive
protection in to fix a real problem or whether I was trying to be
(incorrectly) proactive...
The more I think about this, the more I think I put that protection
in because of a real problem.
I don't remember the specifics, but I have a distinct recollection of
ompi_mpi_abort() being called, and then either
orte_errmgr.abort_procs_request() or orte_errmgr.error_detected()
eventually calling progress() which then triggered some other error
and ompi_mpi_abort() ended up getting called again.
In this scenario, both an endless sleep() *and* calling exit() are bad.
What to do? Even looping calling progress() may not do the Right
Thing in the recursive case because some processing may not occur
until control is returns all the way up to the top of the progress()
stack.
Note that as I stated in my first mail, since the proxy errmgr
component is always selected in MPI processes,
orte_errmgr.error_detected() will not return -- it eventually calls
exit().
--
Jeff Squyres
Cisco Systems