If you have support for FCA then it might happen that the collective will
use the hardware support. In any case, most of the bcast algorithms have a
logarithmic behavior, so there will be at most O(log(P)) memory accesses on
the root.

If you want to take a look at the code in OMPI to understand what function
is called in your specific case head to ompi/mca/coll/tuned/ and search for
the ompi_coll_tuned_bcast_intra_dec_fixed function
in coll_tuned_decision_fixed.c.


On Wed, Mar 20, 2019 at 4:53 AM marcin.krotkiewski <
marcin.krotkiew...@gmail.com> wrote:

> Hi!
> I'm wondering about the details of Bcast implementation in OpenMPI. I'm
> specifically interested in IB interconnects, but information about other
> architectures (and OpenMPI in general) would also be very useful.
> I am working with a code, which sends the same  (large) message to a
> bunch of 'neighboring' processes. Somewhat like a ghost-zone exchange,
> but the message is the same for all neighbors. Since memory bandwidth is
> a scarce resource, I'd like to make sure we send the message with fewest
> possible memory accesses.
> Hence the question: what does OpenMPI (and specifically for the IB case
> - the HPCX) do in such case? Does it get the buffer from memory O(1)
> times to send it to n peers, and the broadcast is orchestrated by the
> hardware? Or does it have to read the memory O(n) times? Is it more
> efficient to use Bcast, or is it the same as implementing the operation
> by n distinct send / put operations? Finally, is there any way to use
> the RMA put method with multiple targets, so that I only have to read
> the host memory once, and the switches / HCA take care of the rest?
> Thanks a lot for any insights!
> Marcin
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
devel mailing list

Reply via email to