Marcin, HPC-X implements the MPI BCAST operation by leveraging hardware multicast capabilities. Starting with HPC-X v2.3 we introduced a new multicast based algorithm for large messages as well. Hardware multicast scales as O(1) modulo switch hops. It is the most efficient way to broadcast a message in an IB network.
Hope this helps. Best, Josh On Thu, Mar 21, 2019 at 5:01 AM marcin.krotkiewski < marcin.krotkiew...@gmail.com> wrote: > Thanks, George! So, the function you mentioned is used when I turn off > HCOLL and use OpenMPI's tuned coll instead. That helps a lot. Another thing > that makes me think is that in my case the data is sent to the targets > asynchronously, or rather - it is a 'put' operation in nature, and the > targets don't know, when the data is ready. I guess the tree algorithms you > mentioned require active participation of all nodes, otherwise the > algorithm will not progress? Is it enough to call any MPI routine to assure > progression, or do I have to call the matching Bcast? > > Anyone from Mellanox here, who knows how HCOLL does this internally? > Especially on the EDR architecture. Is there any hardware aid? > > Thanks! > > Marcin > > > On 3/20/19 5:10 PM, George Bosilca wrote: > > If you have support for FCA then it might happen that the collective will > use the hardware support. In any case, most of the bcast algorithms have a > logarithmic behavior, so there will be at most O(log(P)) memory accesses on > the root. > > If you want to take a look at the code in OMPI to understand what function > is called in your specific case head to ompi/mca/coll/tuned/ and search for > the ompi_coll_tuned_bcast_intra_dec_fixed function > in coll_tuned_decision_fixed.c. > > George. > > > On Wed, Mar 20, 2019 at 4:53 AM marcin.krotkiewski < > marcin.krotkiew...@gmail.com> wrote: > >> Hi! >> >> I'm wondering about the details of Bcast implementation in OpenMPI. I'm >> specifically interested in IB interconnects, but information about other >> architectures (and OpenMPI in general) would also be very useful. >> >> I am working with a code, which sends the same (large) message to a >> bunch of 'neighboring' processes. Somewhat like a ghost-zone exchange, >> but the message is the same for all neighbors. Since memory bandwidth is >> a scarce resource, I'd like to make sure we send the message with fewest >> possible memory accesses. >> >> Hence the question: what does OpenMPI (and specifically for the IB case >> - the HPCX) do in such case? Does it get the buffer from memory O(1) >> times to send it to n peers, and the broadcast is orchestrated by the >> hardware? Or does it have to read the memory O(n) times? Is it more >> efficient to use Bcast, or is it the same as implementing the operation >> by n distinct send / put operations? Finally, is there any way to use >> the RMA put method with multiple targets, so that I only have to read >> the host memory once, and the switches / HCA take care of the rest? >> >> Thanks a lot for any insights! >> >> Marcin >> >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/devel > > > _______________________________________________ > devel mailing > listde...@lists.open-mpi.orghttps://lists.open-mpi.org/mailman/listinfo/devel > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel