There are several OPAL level error codes not used in the current code. OPAL_ERR_TOPO_SLOT_LIST_NOT_SUPPORTED OPAL_ERR_TOPO_SOCKET_NOT_SUPPORTED OPAL_ERR_TOPO_CORE_NOT_SUPPORTED OPAL_ERR_NOT_ENOUGH_SOCKETS OPAL_ERR_NOT_ENOUGH_CORES OPAL_ERR_INVALID_PHYS_CPU OPAL_ERR_MULTIPLE_AFFINITIES
If somebody feels like filling up an RFC to remove them, please feel free to go ahead. george. On Oct 19, 2011, at 18:41 , George Bosilca wrote: > A careful reading of the committed patch, would have pointed out that none of > the concerns raised so far were true, the "old-way" behavior of the OMPI code > was preserved. Moreover, every single of the error codes removed were not > used in ages. > > What Brian pointed out as evil, evil being a subjective notion by itself, > didn't prevent the correct behavior of the code, nor affected in any way it's > correctness. Anyway, to address his concern I pushed a patch (25333) putting > the OMPI error codes back where they were originally. > > In other words we spent a very unproductive day, arguing over unfounded > arguments and "thought-to-be" behaviors. > > george. > > > On Oct 19, 2011, at 17:50 , Barrett, Brian W wrote: > >> George - >> >> I wrote the error code gorp; I'm pretty sure I know exactly how it was >> supposed to work. >> >> There are 58 codes unused between OPAL_NETWORK_NOT_PARSEABLE and >> OPAL_ERR_MAX. I now see what you did with ERR_REQUEST, and it's evil. >> THat's not the intent of the error code logic at all. If you want to >> change that, I'm not necessarily opposed to it, but that's something that >> should be discussed in an RFC. What the current code does is not >> consistent with the original intent. >> >> I don't agree that you shouldn't propagate error codes through OMPI; in >> fact, the original intent of the design was to allow such propagation. >> Again, such a change should be discussed as part of an RFC. >> >> Brian >> >> On 10/19/11 4:50 PM, "George Bosilca" <bosi...@eecs.utk.edu> wrote: >> >>> I don't know how you think that the error codes work in Open MPI, so I'll >>> take the liberty to depict it here so we all agree we're talking about >>> the same thing. >>> >>> The opal_strerror is a nice feature, it allow to register a range of >>> error codes with a particular error converter. Every time you look for >>> the meaning of a particular error code, the first convertor with a range >>> enveloping the looked at value, will translate it into an error string. >>> >>> This is only currently used by OPAL and ORTE directly. It worked at the >>> OMPI level only because we mapped __all__ OMPI errors to OPAL or ORTE >>> ones. This behavior didn't change after my patch, you can still use >>> opal_strerror to get the error string for all OPAL/ORTE/OMPI errors. >>> >>> There is a small "variation" for OMPI_ERR_REQUEST, the only really OMPI >>> specific error code today. The OMPI error codes are actually inserted >>> between the OPAL and the ORTE ones (there is a gap of 100 elements), so >>> there is __no__ possible overlap right now. If at one point we add more >>> than 100 OMPI level, we should certainly revisit this. >>> >>> Now, resulting from my patch, there is a difference. One should not >>> simply forward an ORTE code into the stack of OMPI, and hope it just >>> works. Errors should be dealt with where they happens, and if not >>> possible they should be translated into the actual layer error code. The >>> error propagation should be compartmentalized, and has to be translated >>> into an error code that has a meaning at the OMPI level. The current >>> patch should not prevent the mixed error-code code to work, as >>> opal_strerror retains the same behavior as before. However, this coding >>> practice should be avoided. I tried to clean the current code of such >>> instances few days ago in r25230. >>> >>> Moreover, this is similar to how we deal with the error codes between >>> OMPI and MPI layers, and seems like a sane way to compose libraries. You >>> deal with a specific layer error code when you get it (usually after the >>> call to a function from that specific layer), not later on when you don't >>> even know exactly what the execution path was. >>> >>> george. >>> >>> PS: I'll fix the +/- issue. >>> >>> On Oct 19, 2011, at 14:09 , Jeff Squyres wrote: >>> >>>> Oy, yes, that is bad -- we cannot have overlapping ORTE and OMPI error >>>> codes. That seems like a very bad idea (in addition to the mixing of + >>>> and -). >>>> >>>> For one thing, that breaks opal_strerror(). That, in itself, seems >>>> like a dealbreaker. >>>> >>>> >>>> On Oct 19, 2011, at 1:51 PM, Barrett, Brian W wrote: >>>> >>>>> I actually think it's worse than that. An ORTE error code can now have >>>>> the same error code as an OMPI error. OMPI_ERR_REQUEST and >>>>> ORTE_ERR_RECV_LESS_THANK_POSTED now share the same integer return code. >>>>> Or, they should, if George hadn't made a mistake (see below). The >>>>> sharing >>>>> of return codes seems... bad. >>>>> >>>>> Also, there's a bug in George's patch. Error codes are all negative, >>>>> so >>>>> OMPI_ERR_REQUEST should be OMPI_ERR_BASE -1 and OMPI_ERR_MAX should be >>>>> OMPI_ERR_BASE - 1, not plus 2. >>>>> >>>>> Brian >>>>> >>>>> On 10/19/11 1:32 PM, "Ralph Castain" <r...@open-mpi.org> wrote: >>>>> >>>>>> I've been wrestling with something from this commit, and I'm unsure of >>>>>> the right answer. So please consider this a general design question >>>>>> for >>>>>> the community. >>>>>> >>>>>> This commit removes all the OMPI <-> ORTE equivalent constants - >>>>>> i.e., we >>>>>> used to declare OMPI-prefixed equivalents to every ORTE-prefixed >>>>>> constant. I understand the thinking (or at least, what I suspect was >>>>>> the >>>>>> thought), but it creates an issue. >>>>>> >>>>>> Suppose I have an ompi-level function (A) that calls another >>>>>> ompi-level >>>>>> function (B). Invisible to A is that B calls an orte-level function. B >>>>>> dutifully checks the error return from the orte-level function >>>>>> against an >>>>>> ORTE-prefixed constant. >>>>>> >>>>>> However, if that return isn't "success", what does B return up to A? >>>>>> It >>>>>> cannot return the OMPI equivalent to the orte error constant because >>>>>> it >>>>>> no longer exists. It could return the orte error code, but A has no >>>>>> way >>>>>> of knowing it is going to get a non-OMPI constant, and therefore >>>>>> won't be >>>>>> able to understand it - it will be an "unrecognized error". >>>>>> >>>>>> I guess one option is to require that B "translate" the return code >>>>>> and >>>>>> pass some OMPI error up the chain, but this prevents anything upwards >>>>>> from understanding the nature of the problem and potentially taking >>>>>> corrective and/or alternative action. Seems awfully limiting, as most >>>>>> of >>>>>> the time the only option will be the vanilla "OMPI_ERROR". >>>>>> >>>>>> Thoughts? >>>>> -- >>>>> Brian W. Barrett >>>>> Dept. 1423: Scalable System Software >>>>> Sandia National Laboratories >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com >>>> For corporate legal information go to: >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >> >> >> -- >> Brian W. Barrett >> Dept. 1423: Scalable System Software >> Sandia National Laboratories >> >> >> >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel