Unfortunately, this pinpoint the fact that we didn't test enough the
collective module mixing thing. I went over the tuned collective
functions and changed all instances to use the correct module
information. It is now on the trunk, revision 20267. Simultaneously,I
checked that all other collective components do the right thing ...
and I have to admit tuned was the only faulty one.
This is clearly a bug in the tuned, and correcting it will allow
people to use the hierarch. In the current incarnation 1.3 will mostly/
always segfault when hierarch is active. I would prefer not to give a
broken toy out there. How about pushing r20267 in the 1.3?
george.
On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:
Thanks for digging into this. Can you file a bug? Let's mark it
for v1.3.1.
I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch,
and since hierarch isn't currently selected by default (you must
specifically elevate hierarch's priority to get it to run), there's
no danger that users will run into this problem in default runs.
But clearly the problem needs to be fixed, and therefore we need a
bug to track it.
On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:
I just debugged the Reduce_scatter bug mentioned previously. The
bug is unfortunately not in hierarch, but in tuned.
Here is the code snipplet causing the problems:
int reduce_scatter (...., mca_coll_base_module_t *module)
{
...
err = comm->c_coll.coll_reduce (...., module)
...
}
but should be
{
...
err = comm->c_coll.coll_reduce (..., comm-
>c_coll.coll_reduce_module);
...
}
The problem as it is right now is, that when using hierarch, only a
subset of the function are set, e.g. reduce,allreduce, bcast and
barrier. Thus, reduce_scatter is from tuned in most scenarios, and
calls the subsequent functions with the wrong module. Hierarch of
course does not like that :-)
Anyway, a quick glance through the tuned code reveals a significant
number of instances where this appears(reduce_scatter, allreduce,
allgather, allgatherv). Basic, hierarch and inter seem to do that
mostly correctly.
Thanks
Edgar
--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab http://pstl.cs.uh.edu
Department of Computer Science University of Houston
Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel