On 4/13/2012 6:40 PM, Ralph Castain wrote:
Did you have the param set? I found some missing code in the orted
errmgr that contributed to it, but unless you had set the param in
your test, there is no way it would abort no matter how many procs
exit with non-zero status.
Is mpirun sticking around after all procs have gone a bug? If not then
what is the use of leaving mpirun hanging around?
I'm guessing you have that param set in your test due to our earlier
defining the default to "no abort". I'm content to leave it there, but
wanted to ensure your tests ran clean.
I don't believe we are setting the env-var which is why I think we have
a regression. It also seems very suspicious to me that both Oracle and
IU are seeing the same condition in MTT. I'll look into this more on
Monday.
--td
On Apr 13, 2012, at 4:32 PM, TERRY DONTJE wrote:
I could see if less then N processes exit with non-zero exit code
that the ORTE may choose not to abort the job. However, if all N
processes have exited or aborted I expect everything to clean up and
mpirun to exit. It does not do that at the moment which I think is
what is causing most of the hangs in the MTT trunk runs which did not
occur prior to this week.
--td
On 4/13/2012 5:18 PM, Ralph Castain wrote:
This has come up again because some of the MTT tests depend on a specific
behavior when a process exits with a non-zero status - in this case, they
expect ORTE to abort the job. At some point, the default had been switched to
NOT abort the job if a process exited with a non-zero status.
So I'll throw this out to the community: if any process exits with a non-zero
status, should ORTE abort the job?
I don't personally care, but we ought to decide on something. In the meantime,
I will set the default so we DO abort, thus allowing the MTT runs to complete
correctly.
FWIW: the MCA param orte_abort_non_zero_exit can always be set to control this
behavior.
Ralph
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>