Re: [OMPI devel] [OMPI svn] svn:open-mpi r25323

George Bosilca Wed, 19 Oct 2011 18:41:55 -0400

A careful reading of the committed patch, would have pointed out that none of 
the concerns raised so far were true, the "old-way" behavior of the OMPI code 
was preserved. Moreover, every single of the error codes removed were not used 
in ages.


What Brian pointed out as evil, evil being a subjective notion by itself, 
didn't prevent the correct behavior of the code, nor affected in any way it's 
correctness. Anyway, to address his concern I pushed a patch (25333) putting 
the OMPI error codes back where they were originally.

In other words we spent a very unproductive day, arguing over unfounded 
arguments and "thought-to-be" behaviors.

  george.


On Oct 19, 2011, at 17:50 , Barrett, Brian W wrote:

> George -
> 
> I wrote the error code gorp; I'm pretty sure I know exactly how it was
> supposed to work.
> 
> There are 58 codes unused between OPAL_NETWORK_NOT_PARSEABLE and
> OPAL_ERR_MAX.  I now see what you did with ERR_REQUEST, and it's evil.
> THat's not the intent of the error code logic at all.  If you want to
> change that, I'm not necessarily opposed to it, but that's something that
> should be discussed in an RFC.  What the current code does is not
> consistent with the original intent.
> 
> I don't agree that you shouldn't propagate error codes through OMPI; in
> fact, the original intent of the design was to allow such propagation.
> Again, such a change should be discussed as part of an RFC.
> 
> Brian
> 
> On 10/19/11 4:50 PM, "George Bosilca" <bosi...@eecs.utk.edu> wrote:
> 
>> I don't know how you think that the error codes work in Open MPI, so I'll
>> take the liberty to depict it here so we all agree we're talking about
>> the same thing.
>> 
>> The opal_strerror is a nice feature, it allow to register a range of
>> error codes with a particular error converter. Every time you look for
>> the meaning of a particular error code, the first convertor with a range
>> enveloping the looked at value, will translate it into an error string.
>> 
>> This is only currently used by OPAL and ORTE directly. It worked at the
>> OMPI level only because we mapped __all__ OMPI errors to OPAL or ORTE
>> ones. This behavior didn't change after my patch, you can still use
>> opal_strerror to get the error string for all OPAL/ORTE/OMPI errors.
>> 
>> There is a small "variation" for OMPI_ERR_REQUEST, the only really OMPI
>> specific error code today. The OMPI error codes are actually inserted
>> between the OPAL and the ORTE ones (there is a gap of 100 elements), so
>> there is __no__ possible overlap right now. If at one point we add more
>> than 100 OMPI level, we should certainly revisit this.
>> 
>> Now, resulting from my patch, there is a difference. One should not
>> simply forward an ORTE code into the stack of OMPI, and hope it just
>> works. Errors should be dealt with where they happens, and if not
>> possible they should be translated into the actual layer error code. The
>> error propagation should be compartmentalized, and has to be translated
>> into an error code that has a meaning at the OMPI level. The current
>> patch should not prevent the mixed error-code code to work, as
>> opal_strerror retains the same behavior as before. However, this coding
>> practice should be avoided. I tried to clean the current code of such
>> instances few days ago in r25230.
>> 
>> Moreover, this is similar to how we deal with the error codes between
>> OMPI and MPI layers, and seems like a sane way to compose libraries. You
>> deal with a specific layer error code when you get it (usually after the
>> call to a function from that specific layer), not later on when you don't
>> even know exactly what the execution path was.
>> 
>> george.
>> 
>> PS: I'll fix the +/- issue.
>> 
>> On Oct 19, 2011, at 14:09 , Jeff Squyres wrote:
>> 
>>> Oy, yes, that is bad -- we cannot have overlapping ORTE and OMPI error
>>> codes. That seems like a very bad idea (in addition to the mixing of +
>>> and -).
>>> 
>>> For one thing, that breaks opal_strerror().  That, in itself, seems
>>> like a dealbreaker.
>>> 
>>> 
>>> On Oct 19, 2011, at 1:51 PM, Barrett, Brian W wrote:
>>> 
>>>> I actually think it's worse than that.  An ORTE error code can now have
>>>> the same error code as an OMPI error.  OMPI_ERR_REQUEST and
>>>> ORTE_ERR_RECV_LESS_THANK_POSTED now share the same integer return code.
>>>> Or, they should, if George hadn't made a mistake (see below).  The
>>>> sharing
>>>> of return codes seems... bad.
>>>> 
>>>> Also, there's a bug in George's patch.  Error codes are all negative,
>>>> so
>>>> OMPI_ERR_REQUEST should be OMPI_ERR_BASE -1 and OMPI_ERR_MAX should be
>>>> OMPI_ERR_BASE - 1, not plus 2.
>>>> 
>>>> Brian
>>>> 
>>>> On 10/19/11 1:32 PM, "Ralph Castain" <r...@open-mpi.org> wrote:
>>>> 
>>>>> I've been wrestling with something from this commit, and I'm unsure of
>>>>> the right answer. So please consider this a general design question
>>>>> for
>>>>> the community.
>>>>> 
>>>>> This commit removes all the OMPI <-> ORTE equivalent constants -
>>>>> i.e., we
>>>>> used to declare OMPI-prefixed equivalents to every ORTE-prefixed
>>>>> constant. I understand the thinking (or at least, what I suspect was
>>>>> the
>>>>> thought), but it creates an issue.
>>>>> 
>>>>> Suppose I have an ompi-level function (A) that calls another
>>>>> ompi-level
>>>>> function (B). Invisible to A is that B calls an orte-level function. B
>>>>> dutifully checks the error return from the orte-level function
>>>>> against an
>>>>> ORTE-prefixed constant.
>>>>> 
>>>>> However, if that return isn't "success", what does B return up to A?
>>>>> It
>>>>> cannot return the OMPI equivalent to the orte error constant because
>>>>> it
>>>>> no longer exists. It could return the orte error code, but A has no
>>>>> way
>>>>> of knowing it is going to get a non-OMPI constant, and therefore
>>>>> won't be
>>>>> able to understand it - it will be an "unrecognized error".
>>>>> 
>>>>> I guess one option is to require that B "translate" the return code
>>>>> and
>>>>> pass some OMPI error up the chain, but this prevents anything upwards
>>>>> from understanding the nature of the problem and potentially taking
>>>>> corrective and/or alternative action. Seems awfully limiting, as most
>>>>> of
>>>>> the time the only option will be the vanilla "OMPI_ERROR".
>>>>> 
>>>>> Thoughts?
>>>> -- 
>>>> Brian W. Barrett
>>>> Dept. 1423: Scalable System Software
>>>> Sandia National Laboratories
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
> 
> 
> -- 
>  Brian W. Barrett
>  Dept. 1423: Scalable System Software
>  Sandia National Laboratories
> 
> 
> 
> 
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [OMPI svn] svn:open-mpi r25323

Reply via email to