Unfortunately, although this fixed some problems when enabling hierarch coll, there is still a segfault in two of IU's tests that only shows up when we set -mca coll_hierarch_priority 100
See this MTT summary to see how the failures improved on the trunk, but that there are still two that segfault even at 1.4a1r20267: http://www.open-mpi.org/mtt/index.php?do_redir=923 This link just has the remaining failures: http://www.open-mpi.org/mtt/index.php?do_redir=922 So, I'll vote for applying the CMR for 1.3 since it clearly improved things, but there is still more to be done to get coll_hierarch ready for regular use. On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca <bosi...@eecs.utk.edu> wrote: > Here we go by the book :) > > https://svn.open-mpi.org/trac/ompi/ticket/1749 > > george. > > On Jan 13, 2009, at 23:40 , Jeff Squyres wrote: > >> Let's debate tomorrow when people are around, but first you have to file a >> CMR... :-) >> >> On Jan 13, 2009, at 10:28 PM, George Bosilca wrote: >> >>> Unfortunately, this pinpoint the fact that we didn't test enough the >>> collective module mixing thing. I went over the tuned collective functions >>> and changed all instances to use the correct module information. It is now >>> on the trunk, revision 20267. Simultaneously,I checked that all other >>> collective components do the right thing ... and I have to admit tuned was >>> the only faulty one. >>> >>> This is clearly a bug in the tuned, and correcting it will allow people >>> to use the hierarch. In the current incarnation 1.3 will mostly/always >>> segfault when hierarch is active. I would prefer not to give a broken toy >>> out there. How about pushing r20267 in the 1.3? >>> >>> george. >>> >>> >>> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote: >>> >>>> Thanks for digging into this. Can you file a bug? Let's mark it for >>>> v1.3.1. >>>> >>>> I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch, and >>>> since hierarch isn't currently selected by default (you must specifically >>>> elevate hierarch's priority to get it to run), there's no danger that users >>>> will run into this problem in default runs. >>>> >>>> But clearly the problem needs to be fixed, and therefore we need a bug >>>> to track it. >>>> >>>> >>>> >>>> On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote: >>>> >>>>> I just debugged the Reduce_scatter bug mentioned previously. The bug is >>>>> unfortunately not in hierarch, but in tuned. >>>>> >>>>> Here is the code snipplet causing the problems: >>>>> >>>>> int reduce_scatter (...., mca_coll_base_module_t *module) >>>>> { >>>>> ... >>>>> err = comm->c_coll.coll_reduce (...., module) >>>>> ... >>>>> } >>>>> >>>>> >>>>> but should be >>>>> { >>>>> ... >>>>> err = comm->c_coll.coll_reduce (..., comm->c_coll.coll_reduce_module); >>>>> ... >>>>> } >>>>> >>>>> The problem as it is right now is, that when using hierarch, only a >>>>> subset of the function are set, e.g. reduce,allreduce, bcast and barrier. >>>>> Thus, reduce_scatter is from tuned in most scenarios, and calls the >>>>> subsequent functions with the wrong module. Hierarch of course does not >>>>> like >>>>> that :-) >>>>> >>>>> Anyway, a quick glance through the tuned code reveals a significant >>>>> number of instances where this appears(reduce_scatter, allreduce, >>>>> allgather, >>>>> allgatherv). Basic, hierarch and inter seem to do that mostly correctly. >>>>> >>>>> Thanks >>>>> Edgar >>>>> -- >>>>> Edgar Gabriel >>>>> Assistant Professor >>>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu >>>>> Department of Computer Science University of Houston >>>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA >>>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> -- >>>> Jeff Squyres >>>> Cisco Systems >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> -- >> Jeff Squyres >> Cisco Systems >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/