Based on your description, you might want to investigate non blocking collectives (e.g. MPI_Ibcast) or even upcoming persistent collectives

(e.g. MPIX_Bcast_init).

If you know the address of the receive buffer, then you can MPI_Ibcast() on non root ranks very early, and then MPI_Test() or MPI_Wait() to wait for completion.

And you will MPI_Ibcast() (followed by MPI_Wait()) on the root rank when the data to be sent is available in the buffer.



On 3/22/2019 5:14 PM, marcin.krotkiewski wrote:

On 3/21/19 5:31 PM, George Bosilca wrote:
I am not sure I understand your question, a bcast is a collective operation that must be posted by all participants. Independently at what level the bcast is serviced, if some of the participants have not posted their participation to the collective, only partial progress can be made.

Of course you're right, and all ranks will post the Bcast eventually, but not necessarily at the same time. My concern is that, with my understanding of the tree algorithm, if a non-leaf tree node is busy (by e.g., computations) at the time the node upstream tries to send it a message, then it will take time until it can propagate the message downstream to it's sub-tree. During that time the sub-tree nodes may be done computing and idle, waiting for the Bcasted message. So in a realistic scenario, in which not all the nodes participate in a Bcast at the same time, a tree algorithm might not be perfect. Am I wrong?

I guess this problem can be somehow mitigated by a 'smart' algorithm, which would identify such stalls and re-send the message to the sub-tree in question from another node. However, it seems that in a pessimistic scenario this degrades to a simple one-sends-to-all implementation. And my main reasons for asking is minimizing of the memory bandwidth overhead connecting with sending of the message.



On Thu, Mar 21, 2019 at 12:24 PM Joshua Ladd < <>> wrote:


    HPC-X implements the MPI BCAST operation by leveraging hardware
    multicast capabilities. Starting with HPC-X v2.3 we introduced a
    new multicast based algorithm for large messages as well.
    Hardware multicast scales as O(1) modulo switch hops. It is the
    most efficient way to broadcast a message in an IB network.

    Hope this helps.



    On Thu, Mar 21, 2019 at 5:01 AM marcin.krotkiewski
    <>> wrote:

        Thanks, George! So, the function you mentioned is used when I
        turn off HCOLL and use OpenMPI's tuned coll instead. That
        helps a lot. Another thing that makes me think is that in my
        case the data is sent to the targets asynchronously, or
        rather - it is a 'put' operation in nature, and the targets
        don't know, when the data is ready. I guess the tree
        algorithms you mentioned require active participation of all
        nodes, otherwise the algorithm will not progress? Is it
        enough to call any MPI routine to assure progression, or do I
        have to call the matching Bcast?

        Anyone from Mellanox here, who knows how HCOLL does this
        internally? Especially on the EDR architecture. Is there any
        hardware aid?



        On 3/20/19 5:10 PM, George Bosilca wrote:
        If you have support for FCA then it might happen that the
        collective will use the hardware support. In any case, most
        of the bcast algorithms have a logarithmic behavior, so
        there will be at most O(log(P)) memory accesses on the root.

        If you want to take a look at the code in OMPI to understand
        what function is called in your specific case head
        to ompi/mca/coll/tuned/ and search for the
        ompi_coll_tuned_bcast_intra_dec_fixed function
        in coll_tuned_decision_fixed.c.


        On Wed, Mar 20, 2019 at 4:53 AM marcin.krotkiewski
        <>> wrote:


            I'm wondering about the details of Bcast implementation
            in OpenMPI. I'm
            specifically interested in IB interconnects, but
            information about other
            architectures (and OpenMPI in general) would also be
            very useful.

            I am working with a code, which sends the same (large)
            message to a
            bunch of 'neighboring' processes. Somewhat like a
            ghost-zone exchange,
            but the message is the same for all neighbors. Since
            memory bandwidth is
            a scarce resource, I'd like to make sure we send the
            message with fewest
            possible memory accesses.

            Hence the question: what does OpenMPI (and specifically
            for the IB case
            - the HPCX) do in such case? Does it get the buffer from
            memory O(1)
            times to send it to n peers, and the broadcast is
            orchestrated by the
            hardware? Or does it have to read the memory O(n) times?
            Is it more
            efficient to use Bcast, or is it the same as
            implementing the operation
            by n distinct send / put operations? Finally, is there
            any way to use
            the RMA put method with multiple targets, so that I only
            have to read
            the host memory once, and the switches / HCA take care
            of the rest?

            Thanks a lot for any insights!


            devel mailing list

        devel mailing list  <>
        devel mailing list <>

    devel mailing list <>

devel mailing list

devel mailing list
devel mailing list

Reply via email to