If you have support for FCA then it might happen that the collective will use the hardware support. In any case, most of the bcast algorithms have a logarithmic behavior, so there will be at most O(log(P)) memory accesses on the root.
If you want to take a look at the code in OMPI to understand what function is called in your specific case head to ompi/mca/coll/tuned/ and search for the ompi_coll_tuned_bcast_intra_dec_fixed function in coll_tuned_decision_fixed.c. George. On Wed, Mar 20, 2019 at 4:53 AM marcin.krotkiewski < marcin.krotkiew...@gmail.com> wrote: > Hi! > > I'm wondering about the details of Bcast implementation in OpenMPI. I'm > specifically interested in IB interconnects, but information about other > architectures (and OpenMPI in general) would also be very useful. > > I am working with a code, which sends the same (large) message to a > bunch of 'neighboring' processes. Somewhat like a ghost-zone exchange, > but the message is the same for all neighbors. Since memory bandwidth is > a scarce resource, I'd like to make sure we send the message with fewest > possible memory accesses. > > Hence the question: what does OpenMPI (and specifically for the IB case > - the HPCX) do in such case? Does it get the buffer from memory O(1) > times to send it to n peers, and the broadcast is orchestrated by the > hardware? Or does it have to read the memory O(n) times? Is it more > efficient to use Bcast, or is it the same as implementing the operation > by n distinct send / put operations? Finally, is there any way to use > the RMA put method with multiple targets, so that I only have to read > the host memory once, and the switches / HCA take care of the rest? > > Thanks a lot for any insights! > > Marcin > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel