I ran into something this week that I think may require consideration by the 
MPI Forum. Specifically, Rolf found a problem in their MTT runs where the tests 
expect mpirun to return a non-zero exit status because one or more application 
processes did so, even though all application procs terminate normally.

I jury-rigged a simple algo that has mpirun return the exit status of the 
lowest rank that returned non-zero in the case where the job terminated 
normally. We still return the exit code of the first process to abnormally 
terminate (i.e., the process that is first reported to the HNP - may not be the 
first process that aborted).

However, it begs the question - what is the actual behavior supposed to be in 
the case where all procs terminate normally, but some may return (possibly 
different) non-zero codes?

I asked a few MPI users, and got a different answer from every one of them. 
Only consistent response I got was that the MPI standard doesn't say what 
should happen (can someone confirm that?).

Here is a sampling of the responses:

1. return the exit status of the lowest rank that returned non-zero (which I 
implemented for now to silence Rolf's MTT problem)

2. return the exit status of the highest rank that returned non-zero

3. printout a histogram of exit statuses
   - ranks 0-9: 0
   - ranks 10-21,110: 1
   - ranks 22-35,40-51: 2
   ...

4. printout ALL the exit statuses

5. ignore it - mpirun's exit code should only reflect OMPI internals. It is the 
app developer's responsibility to properly deal with non-zero exit conditions 
(e.g., by calling MPI_Abort).

When I circled back around with these alternatives, I got the expected answer: 
everyone felt that all of them were good, and wanted a cmd line option to 
select the behavior for their job. They also noted that --xml should cause any 
of them to output in a defined xml format.

As I told Rolf, I honestly don't care what we do in this case. All I ask for is 
a clearly defined behavior so I don't get yanked in multiple directions, 
constantly circling around from one solution to the next.

So if the MPI standard doesn't specify this behavior, could someone involved in 
the Forum -please- get it to address this??

In the interim, what do -we- think it should do?

Thanks
Ralph


Reply via email to