On Aug 25, 2007, at 4:32 PM, Jeff Squyres wrote:

I unfortunately do not remember whether I put that recursive
protection in to fix a real problem or whether I was trying to be
(incorrectly) proactive...

The more I think about this, the more I think I put that protection in because of a real problem.

I don't remember the specifics, but I have a distinct recollection of ompi_mpi_abort() being called, and then either orte_errmgr.abort_procs_request() or orte_errmgr.error_detected() eventually calling progress() which then triggered some other error and ompi_mpi_abort() ended up getting called again.

In this scenario, both an endless sleep() *and* calling exit() are bad.

What to do? Even looping calling progress() may not do the Right Thing in the recursive case because some processing may not occur until control is returns all the way up to the top of the progress() stack.

Note that as I stated in my first mail, since the proxy errmgr component is always selected in MPI processes, orte_errmgr.error_detected() will not return -- it eventually calls exit().

--
Jeff Squyres
Cisco Systems

Reply via email to