Re: [OMPI devel] [OMPI svn] svn:open-mpi r25323

George Bosilca Wed, 19 Oct 2011 16:50:31 -0400

I don't know how you think that the error codes work in Open MPI, so I'll take 
the liberty to depict it here so we all agree we're talking about the same 
thing.

The opal_strerror is a nice feature, it allow to register a range of error 
codes with a particular error converter. Every time you look for the meaning of 
a particular error code, the first convertor with a range enveloping the looked 
at value, will translate it into an error string.

This is only currently used by OPAL and ORTE directly. It worked at the OMPI 
level only because we mapped __all__ OMPI errors to OPAL or ORTE ones. This 
behavior didn't change after my patch, you can still use opal_strerror to get 
the error string for all OPAL/ORTE/OMPI errors.

There is a small "variation" for OMPI_ERR_REQUEST, the only really OMPI 
specific error code today. The OMPI error codes are actually inserted between 
the OPAL and the ORTE ones (there is a gap of 100 elements), so there is __no__ 
possible overlap right now. If at one point we add more than 100 OMPI level, we 
should certainly revisit this.

Now, resulting from my patch, there is a difference. One should not simply 
forward an ORTE code into the stack of OMPI, and hope it just works. Errors 
should be dealt with where they happens, and if not possible they should be 
translated into the actual layer error code. The error propagation should be 
compartmentalized, and has to be translated into an error code that has a 
meaning at the OMPI level. The current patch should not prevent the mixed 
error-code code to work, as opal_strerror retains the same behavior as before. 
However, this coding practice should be avoided. I tried to clean the current 
code of such instances few days ago in r25230.

Moreover, this is similar to how we deal with the error codes between OMPI and 
MPI layers, and seems like a sane way to compose libraries. You deal with a 
specific layer error code when you get it (usually after the call to a function 
from that specific layer), not later on when you don't even know exactly what 
the execution path was.

  george.

PS: I'll fix the +/- issue.

On Oct 19, 2011, at 14:09 , Jeff Squyres wrote:

> Oy, yes, that is bad -- we cannot have overlapping ORTE and OMPI error codes. 
> That seems like a very bad idea (in addition to the mixing of + and -).
> 
> For one thing, that breaks opal_strerror().  That, in itself, seems like a 
> dealbreaker.
> 
> 
> On Oct 19, 2011, at 1:51 PM, Barrett, Brian W wrote:
> 
>> I actually think it's worse than that.  An ORTE error code can now have
>> the same error code as an OMPI error.  OMPI_ERR_REQUEST and
>> ORTE_ERR_RECV_LESS_THANK_POSTED now share the same integer return code.
>> Or, they should, if George hadn't made a mistake (see below).  The sharing
>> of return codes seems... bad.
>> 
>> Also, there's a bug in George's patch.  Error codes are all negative, so
>> OMPI_ERR_REQUEST should be OMPI_ERR_BASE -1 and OMPI_ERR_MAX should be
>> OMPI_ERR_BASE - 1, not plus 2.
>> 
>> Brian
>> 
>> On 10/19/11 1:32 PM, "Ralph Castain" <[email protected]> wrote:
>> 
>>> I've been wrestling with something from this commit, and I'm unsure of
>>> the right answer. So please consider this a general design question for
>>> the community.
>>> 
>>> This commit removes all the OMPI <-> ORTE equivalent constants - i.e., we
>>> used to declare OMPI-prefixed equivalents to every ORTE-prefixed
>>> constant. I understand the thinking (or at least, what I suspect was the
>>> thought), but it creates an issue.
>>> 
>>> Suppose I have an ompi-level function (A) that calls another ompi-level
>>> function (B). Invisible to A is that B calls an orte-level function. B
>>> dutifully checks the error return from the orte-level function against an
>>> ORTE-prefixed constant.
>>> 
>>> However, if that return isn't "success", what does B return up to A? It
>>> cannot return the OMPI equivalent to the orte error constant because it
>>> no longer exists. It could return the orte error code, but A has no way
>>> of knowing it is going to get a non-OMPI constant, and therefore won't be
>>> able to understand it - it will be an "unrecognized error".
>>> 
>>> I guess one option is to require that B "translate" the return code and
>>> pass some OMPI error up the chain, but this prevents anything upwards
>>> from understanding the nature of the problem and potentially taking
>>> corrective and/or alternative action. Seems awfully limiting, as most of
>>> the time the only option will be the vanilla "OMPI_ERROR".
>>> 
>>> Thoughts?
>> -- 
>> Brian W. Barrett
>> Dept. 1423: Scalable System Software
>> Sandia National Laboratories
>> 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> [email protected]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [OMPI svn] svn:open-mpi r25323

Reply via email to