I just debugged the Reduce_scatter bug mentioned previously. The bug is
unfortunately not in hierarch, but in tuned.
Here is the code snipplet causing the problems:
int reduce_scatter (...., mca_coll_base_module_t *module)
{
...
err = comm->c_coll.coll_reduce (...., module)
...
}
but should be
{
...
err = comm->c_coll.coll_reduce (..., comm->c_coll.coll_reduce_module);
...
}
The problem as it is right now is, that when using hierarch, only a
subset of the function are set, e.g. reduce,allreduce, bcast and
barrier. Thus, reduce_scatter is from tuned in most scenarios, and calls
the subsequent functions with the wrong module. Hierarch of course does
not like that :-)
Anyway, a quick glance through the tuned code reveals a significant
number of instances where this appears(reduce_scatter, allreduce,
allgather, allgatherv). Basic, hierarch and inter seem to do that mostly
correctly.
Thanks
Edgar
--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab http://pstl.cs.uh.edu
Department of Computer Science University of Houston
Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335