Unfortunately, although this fixed some problems when enabling hierarch coll,
there is still a segfault in two of IU's tests that only shows up when we set
-mca coll_hierarch_priority 100

See this MTT summary to see how the failures improved on the trunk,
but that there are still two that segfault even at 1.4a1r20267:
http://www.open-mpi.org/mtt/index.php?do_redir=923

This link just has the remaining failures:
http://www.open-mpi.org/mtt/index.php?do_redir=922

So, I'll vote for applying the CMR for 1.3 since it clearly improved things,
but there is still more to be done to get coll_hierarch ready for regular
use.

On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca <bosi...@eecs.utk.edu> wrote:
> Here we go by the book :)
>
> https://svn.open-mpi.org/trac/ompi/ticket/1749
>
>  george.
>
> On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:
>
>> Let's debate tomorrow when people are around, but first you have to file a
>> CMR... :-)
>>
>> On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:
>>
>>> Unfortunately, this pinpoint the fact that we didn't test enough the
>>> collective module mixing thing. I went over the tuned collective functions
>>> and changed all instances to use the correct module information. It is now
>>> on the trunk, revision 20267. Simultaneously,I checked that all other
>>> collective components do the right thing ... and I have to admit tuned was
>>> the only faulty one.
>>>
>>> This is clearly a bug in the tuned, and correcting it will allow people
>>> to use the hierarch. In the current incarnation 1.3 will mostly/always
>>> segfault when hierarch is active. I would prefer not to give a broken toy
>>> out there. How about pushing r20267 in the 1.3?
>>>
>>> george.
>>>
>>>
>>> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:
>>>
>>>> Thanks for digging into this.  Can you file a bug?  Let's mark it for
>>>> v1.3.1.
>>>>
>>>> I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch, and
>>>> since hierarch isn't currently selected by default (you must specifically
>>>> elevate hierarch's priority to get it to run), there's no danger that users
>>>> will run into this problem in default runs.
>>>>
>>>> But clearly the problem needs to be fixed, and therefore we need a bug
>>>> to track it.
>>>>
>>>>
>>>>
>>>> On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:
>>>>
>>>>> I just debugged the Reduce_scatter bug mentioned previously. The bug is
>>>>> unfortunately not in hierarch, but in tuned.
>>>>>
>>>>> Here is the code snipplet causing the problems:
>>>>>
>>>>> int reduce_scatter (...., mca_coll_base_module_t *module)
>>>>> {
>>>>> ...
>>>>> err = comm->c_coll.coll_reduce (...., module)
>>>>> ...
>>>>> }
>>>>>
>>>>>
>>>>> but should be
>>>>> {
>>>>> ...
>>>>> err = comm->c_coll.coll_reduce (..., comm->c_coll.coll_reduce_module);
>>>>> ...
>>>>> }
>>>>>
>>>>> The problem as it is right now is, that when using hierarch, only a
>>>>> subset of the function are set, e.g. reduce,allreduce, bcast and barrier.
>>>>> Thus, reduce_scatter is from tuned in most scenarios, and calls the
>>>>> subsequent functions with the wrong module. Hierarch of course does not 
>>>>> like
>>>>> that :-)
>>>>>
>>>>> Anyway, a quick glance through the tuned code reveals a significant
>>>>> number of instances where this appears(reduce_scatter, allreduce, 
>>>>> allgather,
>>>>> allgatherv). Basic, hierarch and inter seem to do that mostly correctly.
>>>>>
>>>>> Thanks
>>>>> Edgar
>>>>> --
>>>>> Edgar Gabriel
>>>>> Assistant Professor
>>>>> Parallel Software Technologies Lab      http://pstl.cs.uh.edu
>>>>> Department of Computer Science          University of Houston
>>>>> Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
>>>>> Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> --
>>>> Jeff Squyres
>>>> Cisco Systems
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
    I'm a bright... http://www.the-brights.net/

Reply via email to