A careful reading of the committed patch, would have pointed out that none of the concerns raised so far were true, the "old-way" behavior of the OMPI code was preserved. Moreover, every single of the error codes removed were not used in ages.
What Brian pointed out as evil, evil being a subjective notion by itself, didn't prevent the correct behavior of the code, nor affected in any way it's correctness. Anyway, to address his concern I pushed a patch (25333) putting the OMPI error codes back where they were originally. In other words we spent a very unproductive day, arguing over unfounded arguments and "thought-to-be" behaviors. george. On Oct 19, 2011, at 17:50 , Barrett, Brian W wrote: > George - > > I wrote the error code gorp; I'm pretty sure I know exactly how it was > supposed to work. > > There are 58 codes unused between OPAL_NETWORK_NOT_PARSEABLE and > OPAL_ERR_MAX. I now see what you did with ERR_REQUEST, and it's evil. > THat's not the intent of the error code logic at all. If you want to > change that, I'm not necessarily opposed to it, but that's something that > should be discussed in an RFC. What the current code does is not > consistent with the original intent. > > I don't agree that you shouldn't propagate error codes through OMPI; in > fact, the original intent of the design was to allow such propagation. > Again, such a change should be discussed as part of an RFC. > > Brian > > On 10/19/11 4:50 PM, "George Bosilca" <bosi...@eecs.utk.edu> wrote: > >> I don't know how you think that the error codes work in Open MPI, so I'll >> take the liberty to depict it here so we all agree we're talking about >> the same thing. >> >> The opal_strerror is a nice feature, it allow to register a range of >> error codes with a particular error converter. Every time you look for >> the meaning of a particular error code, the first convertor with a range >> enveloping the looked at value, will translate it into an error string. >> >> This is only currently used by OPAL and ORTE directly. It worked at the >> OMPI level only because we mapped __all__ OMPI errors to OPAL or ORTE >> ones. This behavior didn't change after my patch, you can still use >> opal_strerror to get the error string for all OPAL/ORTE/OMPI errors. >> >> There is a small "variation" for OMPI_ERR_REQUEST, the only really OMPI >> specific error code today. The OMPI error codes are actually inserted >> between the OPAL and the ORTE ones (there is a gap of 100 elements), so >> there is __no__ possible overlap right now. If at one point we add more >> than 100 OMPI level, we should certainly revisit this. >> >> Now, resulting from my patch, there is a difference. One should not >> simply forward an ORTE code into the stack of OMPI, and hope it just >> works. Errors should be dealt with where they happens, and if not >> possible they should be translated into the actual layer error code. The >> error propagation should be compartmentalized, and has to be translated >> into an error code that has a meaning at the OMPI level. The current >> patch should not prevent the mixed error-code code to work, as >> opal_strerror retains the same behavior as before. However, this coding >> practice should be avoided. I tried to clean the current code of such >> instances few days ago in r25230. >> >> Moreover, this is similar to how we deal with the error codes between >> OMPI and MPI layers, and seems like a sane way to compose libraries. You >> deal with a specific layer error code when you get it (usually after the >> call to a function from that specific layer), not later on when you don't >> even know exactly what the execution path was. >> >> george. >> >> PS: I'll fix the +/- issue. >> >> On Oct 19, 2011, at 14:09 , Jeff Squyres wrote: >> >>> Oy, yes, that is bad -- we cannot have overlapping ORTE and OMPI error >>> codes. That seems like a very bad idea (in addition to the mixing of + >>> and -). >>> >>> For one thing, that breaks opal_strerror(). That, in itself, seems >>> like a dealbreaker. >>> >>> >>> On Oct 19, 2011, at 1:51 PM, Barrett, Brian W wrote: >>> >>>> I actually think it's worse than that. An ORTE error code can now have >>>> the same error code as an OMPI error. OMPI_ERR_REQUEST and >>>> ORTE_ERR_RECV_LESS_THANK_POSTED now share the same integer return code. >>>> Or, they should, if George hadn't made a mistake (see below). The >>>> sharing >>>> of return codes seems... bad. >>>> >>>> Also, there's a bug in George's patch. Error codes are all negative, >>>> so >>>> OMPI_ERR_REQUEST should be OMPI_ERR_BASE -1 and OMPI_ERR_MAX should be >>>> OMPI_ERR_BASE - 1, not plus 2. >>>> >>>> Brian >>>> >>>> On 10/19/11 1:32 PM, "Ralph Castain" <r...@open-mpi.org> wrote: >>>> >>>>> I've been wrestling with something from this commit, and I'm unsure of >>>>> the right answer. So please consider this a general design question >>>>> for >>>>> the community. >>>>> >>>>> This commit removes all the OMPI <-> ORTE equivalent constants - >>>>> i.e., we >>>>> used to declare OMPI-prefixed equivalents to every ORTE-prefixed >>>>> constant. I understand the thinking (or at least, what I suspect was >>>>> the >>>>> thought), but it creates an issue. >>>>> >>>>> Suppose I have an ompi-level function (A) that calls another >>>>> ompi-level >>>>> function (B). Invisible to A is that B calls an orte-level function. B >>>>> dutifully checks the error return from the orte-level function >>>>> against an >>>>> ORTE-prefixed constant. >>>>> >>>>> However, if that return isn't "success", what does B return up to A? >>>>> It >>>>> cannot return the OMPI equivalent to the orte error constant because >>>>> it >>>>> no longer exists. It could return the orte error code, but A has no >>>>> way >>>>> of knowing it is going to get a non-OMPI constant, and therefore >>>>> won't be >>>>> able to understand it - it will be an "unrecognized error". >>>>> >>>>> I guess one option is to require that B "translate" the return code >>>>> and >>>>> pass some OMPI error up the chain, but this prevents anything upwards >>>>> from understanding the nature of the problem and potentially taking >>>>> corrective and/or alternative action. Seems awfully limiting, as most >>>>> of >>>>> the time the only option will be the vanilla "OMPI_ERROR". >>>>> >>>>> Thoughts? >>>> -- >>>> Brian W. Barrett >>>> Dept. 1423: Scalable System Software >>>> Sandia National Laboratories >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > > > -- > Brian W. Barrett > Dept. 1423: Scalable System Software > Sandia National Laboratories > > > > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel