Re: [OMPI devel] Memory performance with Bcast

marcin.krotkiewski Fri, 22 Mar 2019 01:52:19 -0700

Josh and Valentin, thanks a lot for your answers! Your understanding ofmy case is essentially correct, but let me briefly refine.

In the case at hand I am running a 3D solver, which computes on gridswith at least 26-neighbors. But that also depends on the grid refinementetc, so there could be much more than that. The code runs with a singleMPI rank per node, with OpenMP for intra-node parallelism, and some ofthe neighbors might well reside on the same node. Admittedly, in thatcase they do not need to participate in the Bcast at all, because thethreads can access each other's memory.

So the size of the communicator might be small in practice, but up to~20 (?) in extreme case. I will have to evaluate that by experiments,but it also depends on the problem solved. But my main concern whenthinking about this is not the efficiency of communication as such, butthe local CPU memory bandwidth. If I use an Isend, say, 20 times, thenwith each send the IB hardware will have to read the message from themain memory. My hope was to send it to the hardware once, and be donewith it. From this point of view it doesn't really matter how thehardware handles the message, what is the complexity of the Bcast /Mcast algorithm, and if the communication will or not be faster than thetrivial case. I just don't want to pollute the local memory bus with thesame data over and over again.

About the RMA Put, I was wondering if it is possible to achieve the sameresult: make a call to some *_put function (not necessarily MPI_Put,could be UCX or any other) with multiple destinations, which reads thedata from the host memory once, and sends it to the hardware. The HWthen propagates it to all other receivers. So, a kind of an RMAbroad-write, or multi-write. Not sure if something like that evenexists, but it would be a perfect choice.


Marcin



On 3/22/19 8:22 AM, Valentin Petrov wrote:

Hi,

One more comment. Regarding the IB Mcast and its usage. The keyadvantage of IB Mcast (enabled with hpcx+hcoll when user utilizesMPI_Bcast collective call) is nearly constant scaling. So, it givesmost advantage when many multiple nodes are used at the same time fora collective. By default, hcoll will use mcast based algorithm whenthe MPI communicator used to launch MPI_Bcast spans at least 8 nodes(does not matter how many ranks-per-node in this case). As far as Iunderstood your case, you have a Neighbor-Exchange pattern, i.e.limited number of neighbors (usually 4 in 2D regular grid, or 6 in 3D)that you need to send the data to. If this is the case then yourapplication most likely has many multiple small (5,7 ranks)communicators. IB mcast is unlikely to give a lot of advantage in thiscase (scale is small). Moreover, if you run multiple ranks per nodethen a lot of communication happens entirely innode where the datagoes through shared memory w/o IB.

Finally, if you want to do RDMA, it is possible. You could doMPI_Put/get. Alternative is to use open_shmem which could be a goodfit for that case (this is the standard that is designed for one-sidedput/get semantics) but this will require some study if you are notfamiliar with it.

*From:* devel <devel-boun...@lists.open-mpi.org> *On Behalf Of *GeorgeBosilca

*Sent:* Thursday, March 21, 2019 7:31 PM
*To:* Open MPI Developers <devel@lists.open-mpi.org>
*Subject:* Re: [OMPI devel] Memory performance with Bcast

Marcin,

I am not sure I understand your question, a bcast is a collectiveoperation that must be posted by all participants. Independently atwhat level the bcast is serviced, if some of the participants have notposted their participation to the collective, only partial progresscan be made.


  George.

On Thu, Mar 21, 2019 at 12:24 PM Joshua Ladd <jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> wrote:


    Marcin,

    HPC-X implements the MPI BCAST operation by leveraging hardware
    multicast capabilities. Starting with HPC-X v2.3 we introduced a
    new multicast based algorithm for large messages as well. Hardware
    multicast scales as O(1) modulo switch hops. It is the most
    efficient way to broadcast a message in an IB network.

    Hope this helps.

    Best,

    Josh

    On Thu, Mar 21, 2019 at 5:01 AM marcin.krotkiewski
    <marcin.krotkiew...@gmail.com
    <mailto:marcin.krotkiew...@gmail.com>> wrote:

        Thanks, George! So, the function you mentioned is used when I
        turn off HCOLL and use OpenMPI's tuned coll instead. That
        helps a lot. Another thing that makes me think is that in my
        case the data is sent to the targets asynchronously, or rather
        - it is a 'put' operation in nature, and the targets don't
        know, when the data is ready. I guess the tree algorithms you
        mentioned require active participation of all nodes, otherwise
        the algorithm will not progress? Is it enough to call any MPI
        routine to assure progression, or do I have to call the
        matching Bcast?

        Anyone from Mellanox here, who knows how HCOLL does this
        internally? Especially on the EDR architecture. Is there any
        hardware aid?

        Thanks!

        Marcin

        On 3/20/19 5:10 PM, George Bosilca wrote:

            If you have support for FCA then it might happen that the
            collective will use the hardware support. In any case,
            most of the bcast algorithms have a logarithmic behavior,
            so there will be at most O(log(P)) memory accesses on the
            root.

            If you want to take a look at the code in OMPI to
            understand what function is called in your specific case
            head to ompi/mca/coll/tuned/ and search for the
            ompi_coll_tuned_bcast_intra_dec_fixed function
            in coll_tuned_decision_fixed.c.

              George.

            On Wed, Mar 20, 2019 at 4:53 AM marcin.krotkiewski
            <marcin.krotkiew...@gmail.com
            <mailto:marcin.krotkiew...@gmail.com>> wrote:

                Hi!

                I'm wondering about the details of Bcast
                implementation in OpenMPI. I'm
                specifically interested in IB interconnects, but
                information about other
                architectures (and OpenMPI in general) would also be
                very useful.

                I am working with a code, which sends the same 
                (large) message to a
                bunch of 'neighboring' processes. Somewhat like a
                ghost-zone exchange,
                but the message is the same for all neighbors. Since
                memory bandwidth is
                a scarce resource, I'd like to make sure we send the
                message with fewest
                possible memory accesses.

                Hence the question: what does OpenMPI (and
                specifically for the IB case
                - the HPCX) do in such case? Does it get the buffer
                from memory O(1)
                times to send it to n peers, and the broadcast is
                orchestrated by the
                hardware? Or does it have to read the memory O(n)
                times? Is it more
                efficient to use Bcast, or is it the same as
                implementing the operation
                by n distinct send / put operations? Finally, is there
                any way to use
                the RMA put method with multiple targets, so that I
                only have to read
                the host memory once, and the switches / HCA take care
                of the rest?

                Thanks a lot for any insights!

                Marcin


                _______________________________________________
                devel mailing list
                devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
                https://lists.open-mpi.org/mailman/listinfo/devel
                
<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fdevel&data=02%7C01%7Cvalentinp%40mellanox.com%7Ccbe416b3e84c449a208b08d6ae1acbf7%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636887827476435790&sdata=htYMlrS%2FpIa6w0P%2BSE8dqedDs8meJ7OhoLwlOJiBdlw%3D&reserved=0>

            _______________________________________________

            devel mailing list

            devel@lists.open-mpi.org  <mailto:devel@lists.open-mpi.org>

            https://lists.open-mpi.org/mailman/listinfo/devel  
<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fdevel&data=02%7C01%7Cvalentinp%40mellanox.com%7Ccbe416b3e84c449a208b08d6ae1acbf7%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636887827476435790&sdata=htYMlrS%2FpIa6w0P%2BSE8dqedDs8meJ7OhoLwlOJiBdlw%3D&reserved=0>

        _______________________________________________
        devel mailing list
        devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
        https://lists.open-mpi.org/mailman/listinfo/devel
        
<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fdevel&data=02%7C01%7Cvalentinp%40mellanox.com%7Ccbe416b3e84c449a208b08d6ae1acbf7%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636887827476445795&sdata=5torvJGudvMLEg%2BdoYQnnMlsBYM4uq%2B0tDVNFDWbEm8%3D&reserved=0>

    _______________________________________________
    devel mailing list
    devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
    https://lists.open-mpi.org/mailman/listinfo/devel
    
<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fdevel&data=02%7C01%7Cvalentinp%40mellanox.com%7Ccbe416b3e84c449a208b08d6ae1acbf7%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636887827476445795&sdata=5torvJGudvMLEg%2BdoYQnnMlsBYM4uq%2B0tDVNFDWbEm8%3D&reserved=0>


_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Memory performance with Bcast

Reply via email to