I just debugged the Reduce_scatter bug mentioned previously. The bug is unfortunately not in hierarch, but in tuned.

Here is the code snipplet causing the problems:

int reduce_scatter (...., mca_coll_base_module_t *module)
{
...
   err = comm->c_coll.coll_reduce (...., module)
...
}


but should be
{
...
  err = comm->c_coll.coll_reduce (..., comm->c_coll.coll_reduce_module);
...
}

The problem as it is right now is, that when using hierarch, only a subset of the function are set, e.g. reduce,allreduce, bcast and barrier. Thus, reduce_scatter is from tuned in most scenarios, and calls the subsequent functions with the wrong module. Hierarch of course does not like that :-)

Anyway, a quick glance through the tuned code reveals a significant number of instances where this appears(reduce_scatter, allreduce, allgather, allgatherv). Basic, hierarch and inter seem to do that mostly correctly.

Thanks
Edgar
--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab      http://pstl.cs.uh.edu
Department of Computer Science          University of Houston
Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335

Reply via email to