Brian, I thought the OPAL_SOS stuff was supposed to be the way to fix this? https://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
Has that effort faded or not worked? On Wed, Nov 2, 2011 at 2:22 PM, Barrett, Brian W <bwba...@sandia.gov> wrote: > To be honest, I don't care so much today, I'm just fighting so that the > output doesn't get worse. At some point, we do need to figure out a > better way of dealing with error messages, but not today :). > > Brian > > On 11/2/11 11:53 AM, "Ralph Castain" <r...@open-mpi.org> wrote: > >>Hmmm....since it was my bug that surfaced the problem, maybe the best >>answer is to just return an error code. I'll slowly work thru the param >>registrations in ORTE and make them all check the return code. I'm >>willing to look at OPAL as I go, but someone else will have to deal with >>the OMPI layer. >> >>I don't know how to entirely avoid the message issue Brian mentions - >>I'll still have to say -something- when I get an error code, but I have >>come up with some methods for reducing the clutter. >> >>On Nov 2, 2011, at 11:43 AM, Barrett, Brian W wrote: >> >>> I really don't like our show_help at every level behavior (look at what >>> happens when MPI_INIT fails, you get a page per process of the same >>>error >>> message from each level of the call stack). If you want to show_help >>>and >>> abort on debug, that makes sense. It doesn't make any sense on a >>> production build. Return an error code and let the upper layer deal >>>with >>> it. >>> >>> Brian >>> >>> On 11/2/11 11:27 AM, "Jeff Squyres" <jsquy...@cisco.com> wrote: >>> >>>> Brian: you were the one that had an allergic reaction to #1 on the >>>>call. >>>> >>>> Thoughts? >>>> >>>> >>>> On Nov 2, 2011, at 1:23 PM, George Bosilca wrote: >>>> >>>>> As it has been said, this is not something supposed to make it in a >>>>> release. On the unfortunate case where it does, always having a >>>>> show_help will ensure a quick complaint on one of our mailing lists >>>>>and >>>>> increase the probability of a [very] quick fix. >>>>> >>>>> george. >>>>> >>>>> On Nov 2, 2011, at 06:26 , TERRY DONTJE wrote: >>>>> >>>>>> >>>>>> >>>>>> On 11/1/2011 7:48 PM, Jeff Squyres wrote: >>>>>>> So this was slightly different than the opinion that was discussed >>>>>>>on >>>>>>> the call today, which was 2. The rationale for #2 was to punish >>>>>>> developers, but if such a bug did make it through to production, >>>>>>>users >>>>>>> wouldn't be annoyed with show_help messages all the time. >>>>>>> >>>>>>> Does anyone have strong opinions here? I don't. >>>>>>> >>>>>>> I offer the following two points: >>>>>>> >>>>>>> - this is a coding error on the OMPI developer >>>>>>> - it's pretty rare >>>>>>> >>>>>>> >>>>>> I think a show_help + return is very helpful in this case. I >>>>>>wouldn't >>>>>> think that we'd run into this case that much and it would seem that >>>>>>it >>>>>> would be a rare occurance that one could just fix when they run into >>>>>> it. However, since there was some opposition to having show_help >>>>>> messages possibly coming up all over the place I thought a fall >>>>>> back of only doing the show_help on enable_debug builds was a >>>>>> reasonable middle ground. >>>>>> >>>>>> --td >>>>>>> On Nov 1, 2011, at 7:30 PM, George Bosilca wrote: >>>>>>> >>>>>>> >>>>>>>> 1 >>>>>>>> >>>>>>>> george. >>>>>>>> >>>>>>>> On Nov 1, 2011, at 17:23 , Jeff Squyres wrote: >>>>>>>> >>>>>>>> >>>>>>>>> Can you clarify -- I can parse your text multiple ways. Which are >>>>>>>>> you voting for? >>>>>>>>> >>>>>>>>> 1. show_help + return error code in all cases. >>>>>>>>> 2. if OPAL_ENABLE_DEBUG, show_help + exit(1), else silently return >>>>>>>>> error code. >>>>>>>>> 3. show_help. if OPAL_ENABLE_DEBUG, exit(1), else return error >>>>>>>>> code. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Nov 1, 2011, at 4:50 PM, George Bosilca wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>>> This is a much saner solution. We [mostly] stayed away from >>>>>>>>>> calling exit deep into our libraries, there is no reason to add >>>>>>>>>>it >>>>>>>>>> now. I'll vote in favor of show_help + return code. >>>>>>>>>> >>>>>>>>>> george. >>>>>>>>>> >>>>>>>>>> On Nov 1, 2011, at 15:14 , Jeff Squyres wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> We talked about this on the call today. >>>>>>>>>>> >>>>>>>>>>> A good suggestion was made: call show_help/opal_finalize/exit >>>>>>>>>>> only when OPAL_ENABLE_DEBUG is true. Otherwise, return an error >>>>>>>>>>> code. >>>>>>>>>>> >>>>>>>>>>> If no one objects to this, I'll commit this tomorrow. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Oct 31, 2011, at 4:16 PM, Jeff Squyres wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> WHAT: what to do if registering an MCA param results in an >>>>>>>>>>>>error? >>>>>>>>>>>> >>>>>>>>>>>> WHERE: opal/mca/base/mca_base_param.c >>>>>>>>>>>> >>>>>>>>>>>> WHY: MCA param re-registration issues should be treated as OMPI >>>>>>>>>>>> developer errors >>>>>>>>>>>> >>>>>>>>>>>> WHEN: COB Friday, 4 Nov 2011 >>>>>>>>>>>> >>>>>>>>>>>> ----------------- >>>>>>>>>>>> >>>>>>>>>>>> Short version: >>>>>>>>>>>> >>>>>>>>>>>> Re-registering an MCA param to be a different type (e.g., it >>>>>>>>>>>>was >>>>>>>>>>>> initially registered to be a string, but was later >>>>>>>>>>>>re-registered >>>>>>>>>>>> to be an int) should be treated as an OMPI developer error, and >>>>>>>>>>>> should opal_finalize()/exit(1). >>>>>>>>>>>> >>>>>>>>>>>> More details: >>>>>>>>>>>> >>>>>>>>>>>> A mistaken MCA param re-registration recently caused an orted >>>>>>>>>>>> segv. >>>>>>>>>>>> >>>>>>>>>>>> The MCA param subsystem was fixed to avoid this segv, but >>>>>>>>>>>> silently convert the MCA param to the newly-registered type. >>>>>>>>>>>> Upon reflection and some discussion, this seems to be a bad >>>>>>>>>>>>idea. >>>>>>>>>>>> Instead, we should loudly complain via a show_help message and >>>>>>>>>>>> then exit(1). >>>>>>>>>>>> >>>>>>>>>>>> Specifically: this kind of behavior is clearly an error and >>>>>>>>>>>> should be fixed. Unfortunately, in most cases, we don't >>>>>>>>>>>>actually >>>>>>>>>>>> check the return value from MCA param registration functions, >>>>>>>>>>>>so >>>>>>>>>>>> if we change the MCA param function to simply return a non >>>>>>>>>>>> OPAL_SUCCESS status, it's unlikely that anyone will notice >>>>>>>>>>>>until >>>>>>>>>>>> some code tries to read the param value, likely still resulting >>>>>>>>>>>> in a segv. >>>>>>>>>>>> >>>>>>>>>>>> Does anyone have heartburn if I change the error behavior to >>>>>>>>>>>> opal_finalize()/exit(1)? >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Jeff Squyres >>>>>>>>>>>> >>>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>>> >>>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>>> >>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> >>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> -- >>>>>>>>>>> Jeff Squyres >>>>>>>>>>> >>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>> >>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>> >>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> -- >>>>>>>>> Jeff Squyres >>>>>>>>> >>>>>>>>> jsquy...@cisco.com >>>>>>>>> >>>>>>>>> For corporate legal information go to: >>>>>>>>> >>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> -- >>>>>> <Mail Attachment.gif> >>>>>> Terry D. Dontje | Principal Software Engineer >>>>>> Developer Tools Engineering | +1.781.442.2631 >>>>>> Oracle - Performance Technologies >>>>>> 95 Network Drive, Burlington, MA 01803 >>>>>> Email terry.don...@oracle.com >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com >>>> For corporate legal information go to: >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >>>> >>>> >>> >>> >>> -- >>> Brian W. Barrett >>> Dept. 1423: Scalable System Software >>> Sandia National Laboratories >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >>_______________________________________________ >>devel mailing list >>de...@open-mpi.org >>http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > > > -- > Brian W. Barrett > Dept. 1423: Scalable System Software > Sandia National Laboratories > > > > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ timat...@open-mpi.org || tmat...@gmail.com I'm a bright... http://www.the-brights.net/