r20275 looks good. I suggest that we CMR that into 1.3 and get rc6 rolled and tested. (actually, Jeff just did the CMR...so off to rc6) --brad
On Wed, Jan 14, 2009 at 1:16 PM, Edgar Gabriel <gabr...@cs.uh.edu> wrote: > so I am not entirely sure why the bug only happened on trunk, it could in > theory also appear on v1.3 (is there a difference on how pointer_arrays are > handled between the two versions?) > > Anyway, it passes now on both with changeset 20275. We should probably move > that over to 1.3 as well, whether for 1.3.0 or 1.3.1 I leave that up to > others to decide... > > Thanks > Edgar > > > Edgar Gabriel wrote: > >> I'm already debugging it. the good news is that it only seems to appear >> with trunk, with 1.3 (after copying the new tuned module over), all the >> tests pass. >> >> Now if somebody can tell me a trick on how to tell mpirun not kill the >> debugger under my feet, then I could even see where the problem occurs:-) >> >> Thanks >> Edga >> >> George Bosilca wrote: >> >>> All these errors are in the MPI_Finalize, it should not be that hard to >>> find. I'll take a look later this afternoon. >>> >>> george. >>> >>> On Jan 14, 2009, at 06:41 , Tim Mattox wrote: >>> >>> Unfortunately, although this fixed some problems when enabling hierarch >>>> coll, >>>> there is still a segfault in two of IU's tests that only shows up when >>>> we set >>>> -mca coll_hierarch_priority 100 >>>> >>>> See this MTT summary to see how the failures improved on the trunk, >>>> but that there are still two that segfault even at 1.4a1r20267: >>>> http://www.open-mpi.org/mtt/index.php?do_redir=923 >>>> >>>> This link just has the remaining failures: >>>> http://www.open-mpi.org/mtt/index.php?do_redir=922 >>>> >>>> So, I'll vote for applying the CMR for 1.3 since it clearly improved >>>> things, >>>> but there is still more to be done to get coll_hierarch ready for >>>> regular >>>> use. >>>> >>>> On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca <bosi...@eecs.utk.edu> >>>> wrote: >>>> >>>>> Here we go by the book :) >>>>> >>>>> https://svn.open-mpi.org/trac/ompi/ticket/1749 >>>>> >>>>> george. >>>>> >>>>> On Jan 13, 2009, at 23:40 , Jeff Squyres wrote: >>>>> >>>>> Let's debate tomorrow when people are around, but first you have to >>>>>> file a >>>>>> CMR... :-) >>>>>> >>>>>> On Jan 13, 2009, at 10:28 PM, George Bosilca wrote: >>>>>> >>>>>> Unfortunately, this pinpoint the fact that we didn't test enough the >>>>>>> collective module mixing thing. I went over the tuned collective >>>>>>> functions >>>>>>> and changed all instances to use the correct module information. It >>>>>>> is now >>>>>>> on the trunk, revision 20267. Simultaneously,I checked that all other >>>>>>> collective components do the right thing ... and I have to admit >>>>>>> tuned was >>>>>>> the only faulty one. >>>>>>> >>>>>>> This is clearly a bug in the tuned, and correcting it will allow >>>>>>> people >>>>>>> to use the hierarch. In the current incarnation 1.3 will >>>>>>> mostly/always >>>>>>> segfault when hierarch is active. I would prefer not to give a broken >>>>>>> toy >>>>>>> out there. How about pushing r20267 in the 1.3? >>>>>>> >>>>>>> george. >>>>>>> >>>>>>> >>>>>>> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote: >>>>>>> >>>>>>> Thanks for digging into this. Can you file a bug? Let's mark it >>>>>>>> for >>>>>>>> v1.3.1. >>>>>>>> >>>>>>>> I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch, >>>>>>>> and >>>>>>>> since hierarch isn't currently selected by default (you must >>>>>>>> specifically >>>>>>>> elevate hierarch's priority to get it to run), there's no danger >>>>>>>> that users >>>>>>>> will run into this problem in default runs. >>>>>>>> >>>>>>>> But clearly the problem needs to be fixed, and therefore we need a >>>>>>>> bug >>>>>>>> to track it. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote: >>>>>>>> >>>>>>>> I just debugged the Reduce_scatter bug mentioned previously. The >>>>>>>>> bug is >>>>>>>>> unfortunately not in hierarch, but in tuned. >>>>>>>>> >>>>>>>>> Here is the code snipplet causing the problems: >>>>>>>>> >>>>>>>>> int reduce_scatter (...., mca_coll_base_module_t *module) >>>>>>>>> { >>>>>>>>> ... >>>>>>>>> err = comm->c_coll.coll_reduce (...., module) >>>>>>>>> ... >>>>>>>>> } >>>>>>>>> >>>>>>>>> >>>>>>>>> but should be >>>>>>>>> { >>>>>>>>> ... >>>>>>>>> err = comm->c_coll.coll_reduce (..., >>>>>>>>> comm->c_coll.coll_reduce_module); >>>>>>>>> ... >>>>>>>>> } >>>>>>>>> >>>>>>>>> The problem as it is right now is, that when using hierarch, only a >>>>>>>>> subset of the function are set, e.g. reduce,allreduce, bcast and >>>>>>>>> barrier. >>>>>>>>> Thus, reduce_scatter is from tuned in most scenarios, and calls the >>>>>>>>> subsequent functions with the wrong module. Hierarch of course does >>>>>>>>> not like >>>>>>>>> that :-) >>>>>>>>> >>>>>>>>> Anyway, a quick glance through the tuned code reveals a significant >>>>>>>>> number of instances where this appears(reduce_scatter, allreduce, >>>>>>>>> allgather, >>>>>>>>> allgatherv). Basic, hierarch and inter seem to do that mostly >>>>>>>>> correctly. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Edgar >>>>>>>>> -- >>>>>>>>> Edgar Gabriel >>>>>>>>> Assistant Professor >>>>>>>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu >>>>>>>>> Department of Computer Science University of Houston >>>>>>>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA >>>>>>>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Jeff Squyres >>>>>>>> Cisco Systems >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Jeff Squyres >>>>>> Cisco Systems >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> >>>> >>>> >>>> -- >>>> Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ >>>> tmat...@gmail.com || timat...@open-mpi.org >>>> I'm a bright... http://www.the-brights.net/ >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> > -- > Edgar Gabriel > Assistant Professor > Parallel Software Technologies Lab http://pstl.cs.uh.edu > Department of Computer Science University of Houston > Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA > Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >