On 4/13/2012 6:40 PM, Ralph Castain wrote:
Did you have the param set? I found some missing code in the orted errmgr that contributed to it, but unless you had set the param in your test, there is no way it would abort no matter how many procs exit with non-zero status.

Is mpirun sticking around after all procs have gone a bug? If not then what is the use of leaving mpirun hanging around?
I'm guessing you have that param set in your test due to our earlier defining the default to "no abort". I'm content to leave it there, but wanted to ensure your tests ran clean.

I don't believe we are setting the env-var which is why I think we have a regression. It also seems very suspicious to me that both Oracle and IU are seeing the same condition in MTT. I'll look into this more on Monday.

--td

On Apr 13, 2012, at 4:32 PM, TERRY DONTJE wrote:

I could see if less then N processes exit with non-zero exit code that the ORTE may choose not to abort the job. However, if all N processes have exited or aborted I expect everything to clean up and mpirun to exit. It does not do that at the moment which I think is what is causing most of the hangs in the MTT trunk runs which did not occur prior to this week.

--td

On 4/13/2012 5:18 PM, Ralph Castain wrote:
This has come up again because some of the MTT tests depend on a specific 
behavior when a process exits with a non-zero status - in this case, they 
expect ORTE to abort the job. At some point, the default had been switched to 
NOT abort the job if a process exited with a non-zero status.

So I'll throw this out to the community: if any process exits with a non-zero 
status, should ORTE abort the job?

I don't personally care, but we ought to decide on something. In the meantime, 
I will set the default so we DO abort, thus allowing the MTT runs to complete 
correctly.

FWIW: the MCA param orte_abort_non_zero_exit can always be set to control this 
behavior.

Ralph


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>



_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel



_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>



Reply via email to