Re: [OMPI devel] Memory performance with Bcast

Joshua Ladd Thu, 21 Mar 2019 09:25:55 -0700

Marcin,

HPC-X implements the MPI BCAST operation by leveraging hardware multicast
capabilities. Starting with HPC-X v2.3 we introduced a new multicast based
algorithm for large messages as well. Hardware multicast scales as O(1)
modulo switch hops. It is the most efficient way to broadcast a message in
an IB network.


Hope this helps.

Best,

Josh



On Thu, Mar 21, 2019 at 5:01 AM marcin.krotkiewski <
marcin.krotkiew...@gmail.com> wrote:

> Thanks, George! So, the function you mentioned is used when I turn off
> HCOLL and use OpenMPI's tuned coll instead. That helps a lot. Another thing
> that makes me think is that in my case the data is sent to the targets
> asynchronously, or rather - it is a 'put' operation in nature, and the
> targets don't know, when the data is ready. I guess the tree algorithms you
> mentioned require active participation of all nodes, otherwise the
> algorithm will not progress? Is it enough to call any MPI routine to assure
> progression, or do I have to call the matching Bcast?
>
> Anyone from Mellanox here, who knows how HCOLL does this internally?
> Especially on the EDR architecture. Is there any hardware aid?
>
> Thanks!
>
> Marcin
>
>
> On 3/20/19 5:10 PM, George Bosilca wrote:
>
> If you have support for FCA then it might happen that the collective will
> use the hardware support. In any case, most of the bcast algorithms have a
> logarithmic behavior, so there will be at most O(log(P)) memory accesses on
> the root.
>
> If you want to take a look at the code in OMPI to understand what function
> is called in your specific case head to ompi/mca/coll/tuned/ and search for
> the ompi_coll_tuned_bcast_intra_dec_fixed function
> in coll_tuned_decision_fixed.c.
>
>   George.
>
>
> On Wed, Mar 20, 2019 at 4:53 AM marcin.krotkiewski <
> marcin.krotkiew...@gmail.com> wrote:
>
>> Hi!
>>
>> I'm wondering about the details of Bcast implementation in OpenMPI. I'm
>> specifically interested in IB interconnects, but information about other
>> architectures (and OpenMPI in general) would also be very useful.
>>
>> I am working with a code, which sends the same  (large) message to a
>> bunch of 'neighboring' processes. Somewhat like a ghost-zone exchange,
>> but the message is the same for all neighbors. Since memory bandwidth is
>> a scarce resource, I'd like to make sure we send the message with fewest
>> possible memory accesses.
>>
>> Hence the question: what does OpenMPI (and specifically for the IB case
>> - the HPCX) do in such case? Does it get the buffer from memory O(1)
>> times to send it to n peers, and the broadcast is orchestrated by the
>> hardware? Or does it have to read the memory O(n) times? Is it more
>> efficient to use Bcast, or is it the same as implementing the operation
>> by n distinct send / put operations? Finally, is there any way to use
>> the RMA put method with multiple targets, so that I only have to read
>> the host memory once, and the switches / HCA take care of the rest?
>>
>> Thanks a lot for any insights!
>>
>> Marcin
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
>
>
> _______________________________________________
> devel mailing 
> listde...@lists.open-mpi.orghttps://lists.open-mpi.org/mailman/listinfo/devel
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Memory performance with Bcast

Reply via email to