On Apr 14 2011, Ralph Castain wrote:
I've run across an interesting issue for which I don't have a ready answer.
If an MPI process aborts, we automatically abort the entire job.
If an MPI process returns a non-zero exit status, indicating that there
was something abnormal about its termination, we ignore it and let the
job continue. We do print an error message out upon completion of the
job, but we don't terminate the job upon receiving the non-zero status.
Note that non-zero status is considered a "standard" method of indicating
abnormal termination, though no meaning has been agreed upon for the
specific value.
Not really. See below.
Should we be allowing the job to continue in that circumstance? In the
case I'm reviewing, the user's code indicates there is an error in the
result. Since he has already called MPI_Finalize, he can't call
MPI_Abort, and his system won't allow him to drop cores by calling
"abort". So the exit status is his only way of indicating "abnormal
termination".
Obviously, in this case, he would prefer the job terminate as nothing
useful is going to be accomplished - so no point in tying up the machine.
Thoughts?
Blame Unix. Seriously. Many or most mainframes had the following
categories:
Complete success - or, rather, a failure to detect an error :-)
Partial success, with warnings of potential problems
Failure that was diagnosed and partially cleaned-up
Heap horrible failure - all bets are off
Unix has no such categorisation. The distinction between a zero return
and other values can occur at any point, and some programs even use them
as flags. It's hopeless, and whatever you do will be wrong for many
people. I have no idea what Microsoft do, but assume that it has copied
Unix, as that is its SOP. I recommend NOT rocking this boat.
He might do better by calling abort after MPI_Finalize, but that's
pretty iffy - just like all other approaches. To improve this needs a
new function or argument to MPI_Finalize.
Regards,
Nick Maclaren.