Re: [OMPI devel] Memory performance with Bcast

Gilles Gouaillardet Sun, 24 Mar 2019 17:08:20 -0700

Marcin,

Based on your description, you might want to investigate non blockingcollectives (e.g. MPI_Ibcast) or even upcoming persistent collectives


(e.g. MPIX_Bcast_init).

If you know the address of the receive buffer, then you can MPI_Ibcast()on non root ranks very early, and then MPI_Test() or MPI_Wait() to waitfor completion.

And you will MPI_Ibcast() (followed by MPI_Wait()) on the root rank whenthe data to be sent is available in the buffer.



Cheers,


Gilles

On 3/22/2019 5:14 PM, marcin.krotkiewski wrote:



On 3/21/19 5:31 PM, George Bosilca wrote:

I am not sure I understand your question, a bcast is a collectiveoperation that must be posted by all participants. Independently atwhat level the bcast is serviced, if some of the participants havenot posted their participation to the collective, only partialprogress can be made.

Of course you're right, and all ranks will post the Bcast eventually,but not necessarily at the same time. My concern is that, with myunderstanding of the tree algorithm, if a non-leaf tree node is busy(by e.g., computations) at the time the node upstream tries to send ita message, then it will take time until it can propagate the messagedownstream to it's sub-tree. During that time the sub-tree nodes maybe done computing and idle, waiting for the Bcasted message. So in arealistic scenario, in which not all the nodes participate in a Bcastat the same time, a tree algorithm might not be perfect. Am I wrong?

I guess this problem can be somehow mitigated by a 'smart' algorithm,which would identify such stalls and re-send the message to thesub-tree in question from another node. However, it seems that in apessimistic scenario this degrades to a simple one-sends-to-allimplementation. And my main reasons for asking is minimizing of thememory bandwidth overhead connecting with sending of the message.


Marcin


  George.

On Thu, Mar 21, 2019 at 12:24 PM Joshua Ladd <jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> wrote:


    Marcin,

    HPC-X implements the MPI BCAST operation by leveraging hardware
    multicast capabilities. Starting with HPC-X v2.3 we introduced a
    new multicast based algorithm for large messages as well.
    Hardware multicast scales as O(1) modulo switch hops. It is the
    most efficient way to broadcast a message in an IB network.

    Hope this helps.

    Best,

    Josh


    On Thu, Mar 21, 2019 at 5:01 AM marcin.krotkiewski
    <marcin.krotkiew...@gmail.com
    <mailto:marcin.krotkiew...@gmail.com>> wrote:

        Thanks, George! So, the function you mentioned is used when I
        turn off HCOLL and use OpenMPI's tuned coll instead. That
        helps a lot. Another thing that makes me think is that in my
        case the data is sent to the targets asynchronously, or
        rather - it is a 'put' operation in nature, and the targets
        don't know, when the data is ready. I guess the tree
        algorithms you mentioned require active participation of all
        nodes, otherwise the algorithm will not progress? Is it
        enough to call any MPI routine to assure progression, or do I
        have to call the matching Bcast?

        Anyone from Mellanox here, who knows how HCOLL does this
        internally? Especially on the EDR architecture. Is there any
        hardware aid?

        Thanks!

        Marcin


        On 3/20/19 5:10 PM, George Bosilca wrote:

        If you have support for FCA then it might happen that the
        collective will use the hardware support. In any case, most
        of the bcast algorithms have a logarithmic behavior, so
        there will be at most O(log(P)) memory accesses on the root.

        If you want to take a look at the code in OMPI to understand
        what function is called in your specific case head
        to ompi/mca/coll/tuned/ and search for the
        ompi_coll_tuned_bcast_intra_dec_fixed function
        in coll_tuned_decision_fixed.c.

          George.


        On Wed, Mar 20, 2019 at 4:53 AM marcin.krotkiewski
        <marcin.krotkiew...@gmail.com
        <mailto:marcin.krotkiew...@gmail.com>> wrote:

            Hi!

            I'm wondering about the details of Bcast implementation
            in OpenMPI. I'm
            specifically interested in IB interconnects, but
            information about other
            architectures (and OpenMPI in general) would also be
            very useful.

            I am working with a code, which sends the same (large)
            message to a
            bunch of 'neighboring' processes. Somewhat like a
            ghost-zone exchange,
            but the message is the same for all neighbors. Since
            memory bandwidth is
            a scarce resource, I'd like to make sure we send the
            message with fewest
            possible memory accesses.

            Hence the question: what does OpenMPI (and specifically
            for the IB case
            - the HPCX) do in such case? Does it get the buffer from
            memory O(1)
            times to send it to n peers, and the broadcast is
            orchestrated by the
            hardware? Or does it have to read the memory O(n) times?
            Is it more
            efficient to use Bcast, or is it the same as
            implementing the operation
            by n distinct send / put operations? Finally, is there
            any way to use
            the RMA put method with multiple targets, so that I only
            have to read
            the host memory once, and the switches / HCA take care
            of the rest?

            Thanks a lot for any insights!

            Marcin


            _______________________________________________
            devel mailing list
            devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
            https://lists.open-mpi.org/mailman/listinfo/devel


        _______________________________________________
        devel mailing list
        devel@lists.open-mpi.org  <mailto:devel@lists.open-mpi.org>
        https://lists.open-mpi.org/mailman/listinfo/devel

        _______________________________________________
        devel mailing list
        devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
        https://lists.open-mpi.org/mailman/listinfo/devel

    _______________________________________________
    devel mailing list
    devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
    https://lists.open-mpi.org/mailman/listinfo/devel


_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Memory performance with Bcast

Reply via email to