If all write to the same destination at the same time - yes.  On older systems 
you could start to see drgradation around 6 procs, but things heald up ok 
further out.  My guess is that you want one such queue per n procs, where n 
might be 8 (have to experiment), so polling costs are low and memory contention 
is manageble.

Rich

----- Original Message -----
From: devel-boun...@open-mpi.org <devel-boun...@open-mpi.org>
To: Open MPI Developers <de...@open-mpi.org>
Sent: Tue Jan 20 06:56:53 2009
Subject: Re: [OMPI devel] RFC: sm Latency

Richard Graham wrote:
> First, the performance improvements look really nice.
> A few questions:
> - How much of an abstraction violation does this introduce ? This
> looks like the btl needs to start “knowing” about MPI level semantics.
> Currently, the btl purposefully is ulp agnostic. I ask for 2 reasons
> - you mention having the btl look at the match header (if I understood
> correctly)
> - not clear to me what you mean by returning the header to the list if
> the irecv does not complete. If it does not complete, why not just
> pass the header back for further processing, if all this is happening
> at the pml level ?
> - The measurements seem to be very dual process specific. Have you
> looked at the impact of these changes on other applications at the
> same process count ? “Real” apps would be interesting, but even hpl
> would be a good start.
> The current sm implementation is aimed only at small smp node count,
> which was really the only relevant type of systems when this code was
> written 5 years ago. For large core counts there is a rather simple
> change that could be put in that is simple to implement, and will give
> you flat scaling for the sort of tests you are running. If you replace
> the fifo’s with a single link list per process in shared memory, with
> senders to this process adding match envelopes atomically, with each
> process reading its own link list (multiple writers and single reader
> in non-threaded situation) there will be only one place to poll,
> regardless of the number of procs involved in the run. One still needs
> other optimizations to lower the absolute latency – perhaps what you
> have suggested. If one really has all N procs trying to write to the
> same fifo at once, performance will stink because of contention, but
> most apps don’t have that behaviour.
If I remember correctly you can get a slow down with method you mention
above even with a handful (4-6 processes) writing to the same destination.

--td

> Rich
>
>
> On 1/17/09 1:48 AM, "Eugene Loh" <eugene....@sun.com> wrote:
>
>
>
>     ------------------------------------------------------------------------
>     *RFC: **sm Latency
>     WHAT:* Introducing optimizations to reduce ping-pong latencies
>     over the sm BTL.
>
>     *WHY:* This is a visible benchmark of MPI performance. We can
>     improve shared-memory latencies from 30% (if hardware latency is
>     the limiting factor) to 2× or more (if MPI software overhead is
>     the limiting factor). At high process counts, the improvement can
>     be 10× or more.
>
>     *WHERE:* Somewhat in the sm BTL, but very importantly also in the
>     PML. Changes can be seen in ssh://www.open-mpi.org/~tdd/hg/fastpath.
>
>     *WHEN:* Upon acceptance. In time for OMPI 1.4.
>
>     *TIMEOUT:* February 6, 2009.
>     ------------------------------------------------------------------------
>     This RFC is being submitted by eugene....@sun.com.
>     *WHY (details)
>     *The sm BTL typically has the lowest hardware latencies of any
>     BTL. Therefore, any OMPI software overhead we otherwise tolerate
>     becomes glaringly obvious in sm latency measurements.
>
>     In particular, MPI pingpong latencies are oft-cited performance
>     benchmarks, popular indications of the quality of an MPI
>     implementation. Competitive vendor MPIs optimize this metric
>     aggressively, both for np=2 pingpongs and for pairwise pingpongs
>     for high np (like the popular HPCC performance test suite).
>
>     Performance reported by HPCC include:
>
>         * MPI_Send()/MPI_Recv() pingpong latency.
>         * MPI_Send()/MPI_Recv() pingpong latency as the number of
>           connections grows.
>         * MPI_Sendrecv() latency.
>
>     The slowdown of latency as the number of sm connections grows
>     becomes increasingly important on large SMPs and ever more
>     prevalent many-core nodes.
>
>     Other MPI implementations, such as Scali and Sun HPC ClusterTools
>     6, introduced such optimizations years ago.
>
>     Performance measurements indicate that the speedups we can expect
>     in OMPI with these optimizations range from 30% (np=2 measurements
>     where hardware is the bottleneck) to 2× (np=2 measurements where
>     software is the bottleneck) to over 10× (large np).
>     *WHAT (details)
>     *Introduce an optimized "fast path" for "immediate" sends and
>     receives. Several actions are recommended here.
>     *1. Invoke the **sm BTL sendi (send-immediate) function
>     *Each BTL is allowed to define a "send immediate" (sendi)
>     function. A BTL is not required to do so, however, in which case
>     the PML calls the standard BTL send function.
>
>     A sendi function has already been written for sm, but it has not
>     been used due to insufficient testing.
>
>     The function should be reviewed, commented in, tested, and used.
>
>     The changes are:
>
>         * *File*: ompi/mca/btl/sm/btl_sm.c
>         * *Declaration/Definition*: mca_btl_sm
>        *
>
>        *
>
>
>         * Comment in the mca_btl_sm_sendi symbol instead of the NULL
>           placeholder so that the already existing sendi function will
>           be discovered and used by the PML.
>        *
>
>         * *Function*: mca_btl_sm_sendi()
>        *
>
>        *
>
>
>         * Review the existing sm sendi code. My suggestions include:
>               o Drop the test against the eager limit since the PML
>                 calls this function only when the eager limit is
>                 respected.
>               o Make sure the function has no side effects in the case
>                 where it does not complete. See Open Issues
>                 <#OpenIssues> , the final section of this document,
>                 for further discussion of "side effects".
>        *
>
>        *
>
>
>         * Mostly, I have reviewed the code and believe it's already
>           suitable for use.
>
>           *2. Move the **sendi call up higher in the PML
>           *Profiling pingpong tests, we find that not so much time is
>           spent in the sm BTL. Rather, the PML consumes a lot of time
>           preparing a "send request". While these complex data
>           structures are needed to track progress of a long message
>           that will be sent in multiple chunks and progressed over
>           multiple entries to and exits from the MPI library, managing
>           this large data structure for an "immediate" send (one
>           chunk, one call) is overkill. Latency can be reduced
>           noticeably if one bypasses this data structure. This means
>           invoking the sendi function as early as possible in the PML.
>
>           The changes are:
>               o *File*: ompi/mca/pml/ob1/pml_ob1_isend.c
>               o *Function*: mca_pml_ob1_send()
>              o
>
>              o
>
>
>               o As soon as we enter the PML send function, try to call
>                 the BTL sendi function. If this fails for whatever
>                 reason, continue with the traditional PML send code
>                 path. If it succeeds, then exit the PML and return up
>                 to the calling layer without having to have wrestled
>                 with the PML send-request data structure.
>              o
>
>               o For better software management, the attempt to find
>                 and use a BTL sendi function can be organized into a
>                 new mca_pml_ob1_sendi() function.
>               o *File*: ompi/mca/pml/ob1/pml_ob1_sendreq.c
>               o *Function*: mca_pml_ob1_send_request_start_copy()
>              o
>
>              o
>
>
>               o Remove this attempt to call the BTL sendi function,
>                 since we've already tried to do so higher up in the PML.
>                 *3. Introduce a BTL **recvi call
>                 *While optimizing the send side of a pingpong
>                 operation is helpful, it is less than half the job. At
>                 least as many savings are possible on the receive side.
>
>                 Corresponding to what we've done on the send side, on
>                 the receive side we can attempt, as soon as we've
>                 entered the PML, to call a BTL recvi
>                 (receive-immediate) function, bypassing the creation
>                 of a complex "receive request" data structure that is
>                 not needed if the receive can be completed immediately.
>
>                 Further, we can perform directed polling. OMPI
>                 pingpong latencies grow significantly as the number of
>                 sm connections increases, while competitors (Scali, in
>                 any case) show absolutely flat latencies with
>                 increasing np. The recvi function could check one
>                 connection for the specified receive and exit quickly
>                 if that message if found.
>
>                 A BTL is granted considerable latitude in the proposed
>                 recvi functions. The principle requirement is that the
>                 recvi /either/ completes the specified receive
>                 completely /or else/ behaves as if the function was
>                 not called at all. (That is, one should be able to
>                 revert to the traditional code path without having to
>                 worry about any recvi side effects. So, for example,
>                 if the recvi function encounters any fragments being
>                 returned to the process, it is permitted to return
>                 those fragments to the free list.)
>
>                 While those are the "hard requirements" for recvi,
>                 there are also some loose guidelines. Mostly, it is
>                 understood that recvi should return "quickly" (a loose
>                 term to be interpreted by the BTL). If recvi can
>                 quickly complete the specified receive, great! If not,
>                 it should return control to the PML, who can then
>                 execute the traditional code path, which can handle
>                 long messages (multiple chunks, multiple entries into
>                 the MPI library) and execute other "progress" functions.
>
>                 The changes are:
>                     + *File*: ompi/mca/btl/btl.h
>                    +
>
>                    +
>
>
>                     + In this file, we add a typedef declaration for
>                       what a generic recvi should look like:
>                    +
>
>                     + typedef int (*mca_btl_base_module_recvi_fn_t)();
>                    +
>                    +
>
>                    +
>
>
>                     + We also add a btl_recvi field so that a BTL can
>                       register its recvi function, if any.
>                     + *File*:
>                     + ompi/mca/btl/elan/btl_elan.c
>                     + ompi/mca/btl/gm/btl_gm.c
>                     + ompi/mca/btl/mx/btl_mx.c
>                     + ompi/mca/btl/ofud/btl_ofud.c
>                     + ompi/mca/btl/openib/btl_openib.c
>                     + ompi/mca/btl/portals/btl_portals.c
>                     + ompi/mca/btl/sctp/btl_sctp.c
>                     + ompi/mca/btl/self/btl_self.c
>                     + ompi/mca/btl/sm/btl_sm.c
>                     + ompi/mca/btl/tcp/btl_tcp.c
>                     + ompi/mca/btl/template/btl_template.c
>                     + ompi/mca/btl/udapl/btl_udapl.c
>                    +
>
>                    +
>
>
>                     + Each BTL must add a recvi field to its module.
>                       In most cases, BTLs will not define a recvi
>                       function, and the field will be set to NULL.
>                     + *File*: ompi/mca/btl/sm/btl_sm.c
>                     + *Function*: mca_btl_sm_recvi()
>                    +
>
>                    +
>
>
>                     + For the sm BTL, we set the field to the name of
>                       the BTL's recvi function: mca_btl_sm_recvi. We
>                       also add code to define the behavior of the
>                       function.
>                     + *File*: ompi/mca/btl/sm/btl_sm.h
>                     + *Prototype*: mca_btl_sm_recvi()
>                    +
>
>                    +
>
>
>                     + We also add a prototype for the new function.
>                     + *File*: ompi/mca/pml/ob1/pml_ob1_irecv.c
>                     + *Function*: mca_pml_ob1_recv()
>                    +
>
>                    +
>
>
>                     + As soon as we enter the PML, we try to find and
>                       use a BTL's recvi function. If we succeed, we
>                       can exit the PML without having had invoked the
>                       heavy-duty PML receive-request data structure.
>                       If we fail, we simply revert to the traditional
>                       PML receive code path, without having to worry
>                       about any side effects that the failed recvi
>                       might have left.
>                    +
>
>                     + It is helpful to contain the recvi attempt in a
>                       new mca_pml_ob1_recvi() function, which we add.
>                     + *File*: ompi/class/ompi_fifo.h
>                     + *Function*: ompi_fifo_probe_tail()
>                    +
>
>                    +
>
>
>                     + We don't want recvi to leave any side effects if
>                       it encounters a message it is not prepared to
>                       handle. Therefore, we need to be able to see
>                       what is on a FIFO without popping that entry off
>                       the FIFO. Therefore, we add this new function
>                       that probes the FIFO without disturbing it.
>                       *4. Introduce an "immediate" data convertor
>                       *One of our aims here is to reduce latency by
>                       bypassing expensive PML send and receive request
>                       data structures. Again, these structures are
>                       useful when we intend to complete a message over
>                       multiple chunks and multiple MPI library
>                       invocations, but are overkill for a message that
>                       can be completed all at once.
>
>                       The same is true of data convertors. Convertors
>                       pack user data into shared-memory buffers or
>                       unpack them on the receive side. Convertors
>                       allow a message to be sent in multiple chunks,
>                       over the course of multiple unrelated MPI calls,
>                       and for noncontiguous datatypes. These
>                       sophisticated data structures are overkill in
>                       some important cases, such as messages that are
>                       handled in a single chunk and in a single MPI
>                       call and consist of a single contiguous block of
>                       data.
>
>                       While data convertors are not typically too
>                       expensive, for shared-memory latency, where all
>                       other costs have been pared back to a minimum,
>                       convertors become noticeable -- around 10%.
>
>                       Therefore, we recognize special cases where we
>                       can have barebones, minimal, data convertors. In
>                       these cases, we initialize the convertor
>                       structure minimally -- e.g., a buffer address, a
>                       number of bytes to copy, and a flag indicating
>                       that all other fields are uninitialized. If this
>                       is not possible (e.g., because a non-contiguous
>                       user-derived datatype is being used), the
>                       "immediate" send or receive uses data convertors
>                       normally.
>
>                       The changes are:
>                           # *File*: ompi/datatype/convertor.h
>                          #
>
>                          #
>
>
>                           # First, we add to the convertor flags a new
>                             flag
>                          #
>
>                           # #define CONVERTOR_IMMEDIATE 0x10000000
>                          #
>                           # to identify a data convertor that has been
>                             initialized only minimally.
>                          #
>
>                           # Further, we add three new functions:
>                                 * ompi_convertor_immediate(): try to
>                                   form an "immediate" convertor
>                                 * ompi_convertor_immediate_pack(): use
>                                   an "immediate" convertor to pack
>                                 * ompi_convertor_immediate_unpack():
>                                   use an "immediate" convertor to unpack
>                           # *File*: ompi/mca/btl/sm/btl_sm.c
>                           # *Function*: mca_btl_sm_sendi and
>                             mca_btl_sm_recvi
>                          #
>
>                          #
>
>
>                           # Use the "immediate" convertor routines to
>                             pack/unpack.
>                           # *File*: ompi/mca/pml/ob1/pml_ob1_isend.c
>                             and ompi/mca/pml/ob1/pml_ob1_irecv.c
>                          #
>
>                          #
>
>
>                           # Have the PML fast path try to construct an
>                             "immediate" convertor.
>                             *5. Introduce an "immediate" **MPI_Sendrecv()
>                             *The optimizations described here should
>                             be extended to MPI_Sendrecv() operations.
>                             In particular, while MPI_Send() and
>                             MPI_Recv() optimizations improve HPCC
>                             "pingpong" latencies, we need
>                             MPI_Sendrecv() optimizations to improve
>                             HPCC "ring" latencies.
>
>                             One challenge is the current OMPI MPI/PML
>                             interface. Today, the OMPI MPI layer
>                             breaks a Sendrecv call up into
>                             Irecv/Send/Wait. This would seem to defeat
>                             fast-path optimizations at least for the
>                             receive. Some options include:
>                                 * allow the MPI layer to call "fast
>                                   path" operations
>                                 * have the PML layer provide a
>                                   Sendrecv interface
>                                 * have the MPI layer emit
>                                   Isend/Recv/Wait and see how
>                                   effectively one can optimize the
>                                   Isend operation in the PML for the
>                                   "immediate" case
>                             *Performance Measurements: Before Optimization
>                             *More measurements are desirable, but here
>                             is a sampling of data that I happen to
>                             have from platforms that I happened to
>                             have access to. This data characterizes
>                             OMPI today, without fast-path optimizations.
>                             *OMPI versus Other MPIs
>                             *Here are pingpong latencies, in μsec,
>                             measured with the OSU latency test for 0
>                             and 8 bytes.
>
>                             0-byte 8-byte
>
>                             OMPI 0.74 0.84 μsec
>                             MPICH 0.70 0.77
>                             We see OMPI lagging MPICH.
>
>                             Scali and HP MPI are presumably
>                             /considerably/ faster, but I have no
>                             recent data.
>
>                             Among other things, one can see that there
>                             is about a 10% penalty for invoking data
>                             convertors.
>                             *Scaling with Process Count
>                             *Here are HPCC pingpong latencies from a
>                             different, older, platform. Though only
>                             two processes participate in the pingpong,
>                             the HPCC test reports that latency for
>                             different numbers of processes in the job.
>                             We see that OMPI performance slows
>                             dramatically as the number of processes is
>                             increased. Scali (data not available) does
>                             not show such a slowdown.
>
>                             np min avg max
>
>                             2 2.688 2.719 2.750 usec
>                             4 2.812 2.875 3.000
>                             6 2.875 3.050 3.250
>                             8 2.875 3.299 3.625
>                             10 2.875 3.447 3.812
>                             12 3.063 3.687 4.375
>                             16 2.687 4.093 5.063
>                             20 2.812 4.492 6.000
>                             24 3.125 5.026 6.562
>                             28 3.250 5.326 7.250
>                             32 3.500 5.830 8.375
>                             36 3.750 6.199 8.938
>                             40 4.062 6.753 10.187
>                             The data show large min-max variations in
>                             latency. These variations happen to depend
>                             on sender and receiver ranks. Here are
>                             latencies (rounded down to the nearst
>                             μsec) for the np=40 case as a function of
>                             sender and receiver rank:
>
>                             --------- rank of one process ----------->
>
>                             - 9 9 9 9 9 9 9 9 9 9 9 9 9 8 8 7 7 7 7 7
>                             6 7 8 7 7 7 7 7 6 7 7 7 6 7 7 7 7 6 7
>                             9 - 9 9 9 9 9 9 9 9 8 8 8 8 8 8 7 7 7 7 8
>                             7 7 7 7 7 6 7 7 7 7 7 6 7 6 7 7 7 7 7
>                             9 9 - 9 9 9 9 9 9 9 8 9 7 7 7 8 9 7 7 7 7
>                             7 7 7 7 7 6 7 8 6 7 7 7 7 7 7 6 7 7 6
>                             9 910 - 9 9 9 8 8 8 7 9 7 8 7 7 7 8 8 7 7
>                             8 7 7 6 7 7 7 7 7 6 6 7 6 7 7 7 7 7 7
>                             9 9 9 9 - 9 9 9 8 8 8 7 7 8 7 8 8 8 7 7 7
>                             8 8 7 6 6 7 8 7 7 6 6 7 7 6 7 7 6 7 7
>                             9 9 9 9 9 - 9 9 9 8 7 7 8 8 8 7 8 7 7 8 8
>                             6 7 7 6 7 7 7 7 6 6 6 7 7 7 7 6 6 6 6
>                             9 9 9 9 9 9 - 9 9 8 9 8 8 8 7 8 8 7 8 6 7
>                             7 7 7 7 7 6 6 7 7 6 7 6 7 6 7 7 6 7 6
>                             9 9 9 9 9 9 9 - 9 8 8 8 8 9 8 7 8 7 8 7 7
>                             6 7 7 7 7 7 6 7 7 7 7 7 7 7 7 6 7 7 7
>                             9 9 8 9 9 8 8 9 - 7 9 9 9 7 7 7 8 8 8 7 7
>                             7 6 7 7 7 6 7 6 6 6 6 7 6 7 6 6 6 7 6
>                             9 9 9 9 7 7 8 8 8 - 8 9 8 7 7 7 8 7 7 7 7
>                             7 7 7 7 7 6 6 7 6 7 6 7 7 6 7 7 6 6 6
>                             9 9 9 9 9 8 9 9 7 9 - 8 7 8 7 7 6 8 7 7 7
>                             6 7 7 7 7 7 7 6 6 6 6 7 7 7 6 6 7 7 6
>                             | 9 8 8 9 8 7 8 8 8 8 7 - 9 7 7 8 7 7 7 7
>                             7 7 7 6 6 6 7 6 7 6 6 6 7 7 6 6 7 6 7 5
>                             | 8 8 9 8 9 7 7 8 8 7 7 8 - 7 8 9 8 7 7 7
>                             6 6 7 7 7 7 7 6 7 6 7 7 7 6 7 6 6 6 6 6
>                             | 8 8 8 8 8 9 7 8 8 7 7 7 7 - 8 8 8 8 7 7
>                             7 6 7 7 7 6 6 6 6 7 7 7 7 6 6 6 6 6 5 6
>                             | 6 7 9 9 9 7 7 8 7 7 8 7 8 7 - 6 8 7 7 7
>                             8 7 7 7 7 6 6 7 7 7 6 7 6 7 7 6 6 6 4 5
>                             | 7 7 6 8 7 8 8 8 7 7 8 7 8 9 7 - 7 7 7 8
>                             7 7 6 7 7 7 7 6 7 6 7 6 6 6 6 6 6 5 5 5
>                             7 9 7 8 9 7 8 7 8 8 8 7 7 7 7 7 - 7 8 7 8
>                             7 7 7 7 7 6 7 7 6 6 7 6 6 6 4 5 5 5 5
>                             rank 8 8 8 7 9 7 8 7 7 7 8 7 7 7 7 7 8 - 7
>                             7 7 7 7 7 7 6 7 7 7 6 6 7 7 6 6 6 6 5 4 5
>                             of 7 7 7 8 6 8 6 7 8 7 6 7 7 7 7 7 7 7 - 7
>                             7 7 7 7 7 6 6 7 6 6 6 6 6 6 6 6 6 5 5 4
>                             the 8 7 8 8 7 8 8 7 7 7 7 7 7 7 7 7 7 7 8
>                             - 7 7 7 7 7 7 7 6 7 6 6 6 6 5 5 5 5 5 4 4
>                             other 8 7 7 8 7 7 7 7 8 7 7 7 8 7 7 7 7 7
>                             7 7 - 7 7 6 7 7 7 7 6 6 7 6 6 6 5 5 5 5 5 5
>                             process 7 6 6 7 7 7 8 7 7 6 6 7 6 7 6 7 8
>                             7 7 8 7 - 7 7 7 7 7 7 6 6 6 6 6 5 5 5 4 4 4 4
>                             7 8 7 7 7 7 7 7 8 8 7 7 7 7 7 6 7 6 7 7 7
>                             7 - 7 7 7 7 6 6 6 4 5 5 6 4 4 4 6 5 5
>                             | 7 6 7 7 7 6 6 7 6 8 7 8 7 7 7 7 7 7 7 7
>                             7 7 7 - 7 6 6 6 6 5 5 5 6 5 4 4 5 5 4 4
>                             | 7 7 7 6 7 7 7 7 8 7 6 7 6 6 7 6 6 6 6 7
>                             6 7 7 7 - 6 6 6 5 5 5 5 5 4 4 5 6 4 5 4
>                             | 6 7 7 7 7 7 7 7 8 8 8 7 7 7 6 7 7 7 6 6
>                             7 7 7 6 5 - 6 5 6 6 5 5 5 4 5 5 5 4 4 4
>                             | 7 7 6 7 7 7 7 8 7 7 7 7 6 7 7 7 7 7 6 7
>                             6 6 6 5 5 4 - 5 5 5 4 5 5 5 4 5 5 4 4 4
>                             | 7 7 7 8 7 6 7 6 7 7 7 7 7 6 7 7 7 7 6 6
>                             6 6 6 4 6 4 5 - 5 4 4 5 4 4 5 5 5 4 4 4
>                             V 7 6 8 7 7 6 6 7 6 7 7 7 7 7 6 7 7 6 6 6
>                             7 6 6 5 6 5 5 4 - 4 5 5 4 4 4 4 4 4 4 5
>                             6 6 6 6 6 6 7 8 7 6 7 7 7 7 6 6 7 6 6 5 5
>                             6 6 5 5 6 5 5 4 - 5 4 4 4 4 4 4 6 4 4
>                             6 6 6 7 6 7 7 7 7 6 7 7 6 6 7 7 7 6 6 6 6
>                             6 5 4 4 4 5 4 4 4 - 5 5 4 4 4 4 4 4 4
>                             7 6 7 6 6 6 7 7 7 6 7 7 6 6 6 7 6 6 6 5 6
>                             5 5 5 5 4 4 4 5 5 6 - 4 4 4 4 4 4 4 4
>                             7 7 6 6 6 6 6 7 7 7 6 7 6 7 7 7 6 6 5 5 4
>                             5 5 4 4 4 4 5 4 4 5 4 - 4 4 4 5 4 4 4
>                             7 6 7 6 6 6 6 6 7 7 7 7 6 7 6 6 6 6 6 5 5
>                             4 5 4 4 4 4 4 4 4 4 4 4 - 5 4 4 4 4 5
>                             7 6 7 7 7 8 7 7 6 6 6 7 6 6 6 6 5 5 4 5 5
>                             5 4 4 5 4 4 4 4 4 4 4 4 4 - 4 4 4 4 4
>                             7 6 7 6 7 6 6 6 6 6 7 7 6 6 6 6 5 5 5 4 4
>                             4 4 4 5 4 4 4 4 4 4 4 4 4 4 - 4 4 4 4
>                             8 6 6 7 7 7 7 8 7 6 6 7 6 6 6 6 5 4 5 4 5
>                             5 4 5 4 4 5 4 4 4 4 5 5 4 4 4 - 4 4 4
>                             7 7 7 6 7 7 6 7 6 6 7 6 6 6 6 5 4 5 4 5 4
>                             4 4 4 4 4 4 4 4 5 4 4 4 4 4 4 4 - 4 4
>                             7 7 7 7 7 6 7 7 6 7 7 7 7 5 4 5 5 4 5 5 4
>                             4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 - 4
>                             7 6 7 7 6 7 6 6 6 6 6 6 6 6 5 5 6 4 4 5 4
>                             5 4 5 4 4 4 4 5 4 4 4 5 4 4 4 4 4 4 -
>                             We see that there is a strong dependence
>                             on process rank. Presumably, this is due
>                             to our polling loop. That is, even if we
>                             receive our message, we still have to poll
>                             the higher numbered ranks before we
>                             complete the receive operation.
>                             *Performance Measurements: After Optimization
>                             *We consider three metrics:
>                                 * HPCC "pingpong" latency
>                                 * OSU latency (0 bytes)
>                                 * OSU latency (8 bytes)
>                             We report data for:
>                                 * OMPI "out of the box"
>                                 * after implementation of steps 1-2
>                                   (send side)
>                                 * after implementation of steps 1-3
>                                   (send and receive sides)
>                                 * after implementation of steps 1-4
>                                   (send and receive sides, plus data
>                                   convertor)
>                             The data are from machines that I just
>                             happened to have available.
>
>                             There is a bit of noise in these results,
>                             but the implications, based on these and
>                             other measurements, are:
>                                 * There is some improvement from the
>                                   send side.
>                                 * There is more improvement from the
>                                   receive side.
>                                 * The data convertor improvements help
>                                   a little more (a few percent) for
>                                   non-null messages.
>                                 * The degree of improvement depends on
>                                   how fast the CPU is relative to the
>                                   memory -- that is, how important
>                                   software overheads are versus
>                                   hardware latency.
>                                       o If the CPU is fast (and
>                                         hardware latency is the
>                                         bottleneck), these
>                                         improvements are less -- say,
>                                         20-30%.
>                                       o If the CPU is slow (and
>                                         software costs are the
>                                         bottleneck), the improvements
>                                         are more dramatic -- nearly a
>                                         factor of 2 for non-null
>                                         messages.
>                                 * As np is increased, latency stays
>                                   flat. This can represent a 10× or
>                                   more improvement over out-of-the-box
>                                   OMPI.
>                             *V20z
>                             *Here are results for a V20z
>                             (burl-ct-v20z-11):
>
>                             HPCC OSU0 OSU8
>
>                             out of box 838 770 850 nsec
>                             Steps 1-2 862 770 860
>                             Steps 1-3 670 610 670
>                             Steps 1-4 642 580 610
>                             *F6900
>                             *Here are np=2 results from a 1.05-GHz
>                             (1.2?) UltraSPARC-IV F6900 server:
>
>                             HPCC OSU0 OSU8
>
>                             out of box 3430 2770 3340 nsec
>                             Steps 1-2 2940 2660 3090
>                             Steps 1-3 1854 1650 1880
>                             Steps 1-4 1660 1640 1750
>                             Here is the dependence on process count
>                             using HPCC:
>
>                             OMPI
>                             "out of the box" optimized
>                             comm ----------------- -----------------
>                             size min avg max min avg max
>
>                             2 2688 2719 2750 1750 1781 1812 nsec
>                             4 2812 2875 3000 1750 1802 1812
>                             6 2875 3050 3250 1687 1777 1812
>                             8 2875 3299 3625 1687 1773 1812
>                             10 2875 3447 3812 1687 1789 1812
>                             12 3063 3687 4375 1687 1796 1813
>                             16 2687 4093 5063 1500 1784 1875
>                             20 2812 4492 6000 1687 1788 1875
>                             24 3125 5026 6562 1562 1776 1875
>                             28 3250 5326 7250 1500 1764 1813
>                             32 3500 5830 8375 1562 1755 1875
>                             36 3750 6199 8938 1562 1755 1875
>                             40 4062 6753 10187 1500 1742 1812
>                             Note:
>                                 * At np=2, these optimizations lead to
>                                   a 2× improvement in shared-memory
>                                   latency.
>                                 * Non-null messages incur more than a
>                                   10% penalty, which is largely
>                                   addressed by our data-convertor
>                                   optimization.
>                                 * At larger np, we maintain our fast
>                                   performance while OMPI "out of the
>                                   box" keeps slowing down more and
>                                   more and more.
>                             *M9000
>                             *Here are results for a 128-core M9000. I
>                             think the system has:
>                                 * 2 hardware threads per core (but we
>                                   only use one hardware thread per core)
>                                 * 4 cores per socket
>                                 * 4 sockets per board
>                                 * 4 boards per (half?)
>                                 * 2 (halves?) per system
>                             As one separates the sender and receiver,
>                             hardware latency increases. Here is the
>                             hierarchy:
>
>                             latency (nsec) bandwidth (Mbyte/sec)
>                             out-of-box fastpath out-of-box fastpath
>                             (on-socket?) 810 480 2000 2000
>                             (on-board?) 2050 1820 1900 1900
>                             (half?) 3030 2840 1680 1680
>                             3150 2960 1660 1660
>                             Note:
>                                 * Latency benefits some hundreds of
>                                   nsecs with fastpath.
>                                 * This latency improvement is striking
>                                   when the hardware latency is small,
>                                   but less noticeable as as the
>                                   hardware latency increases.
>                                 * Bandwidth is not very sensitive to
>                                   hardware latency (due to prefetch)
>                                   and not at all to fast-path
>                                   optimizations.
>                             Here are HPCC pingpong latencies for
>                             increasing process counts:
>
>                             out-of-box fastpath
>                             np ----------------- -----------------
>                             min avg max min avg max
>
>                             2 812 812 812 499 499 499
>                             4 874 921 999 437 494 562
>                             8 937 1847 2624 437 1249 1874
>                             16 1062 2430 2937 437 1557 1937
>                             32 1562 3850 5437 375 2211 2875
>                             64 2687 8329 15874 437 2535 3062
>                             80 3499 16854 41749 374 2647 3437
>                             96 3812 31159 100812 374 2717 3437
>                             128 5187 125774 335187 437 2793 3499
>                             The improvements are tremendous:
>                                 * At low np, latencies are low since
>                                   sender and receiver can be
>                                   colocated. Nevertheless, fast-path
>                                   optimizations provided a nearly 2×
>                                   improvement.
>                                 * As np increases, fast-path latency
>                                   also increases, but this is due to
>                                   higher hardware latencies. Indeed,
>                                   the "min" numbers even drop a
>                                   little. The "max" fast-path numbers
>                                   basically only represent the
>                                   increase in hardware latency.
>                                 * As np increases, OMPI "out of the
>                                   box" latency suffers
>                                   catastrophically. Not only is there
>                                   the issue of more connections to
>                                   poll, but the polling behaviors of
>                                   non-participating processes wreak
>                                   havoc on the performance of measured
>                                   processes.
>                                *
>
>                                 * We can separate the two sources of
>                                   latency degradation by putting the
>                                   np-2 non-participating processes to
>                                   sleep. In that case, latency only
>                                   rises to about 10-20 μsec. So,
>                                   polling of many connections causes a
>                                   substantial rise in latency, while
>                                   the disturbance of hard-poll loops
>                                   on system performance is responsible
>                                   for even more degradation.
>                                   Actually, even bandwidth benefits:
>
>                                   out-of-box fastpath
>                                   np -------------- -------------
>                                   min avg max min avg max
>
>                                   2 2015 2034 2053 2028 2039 2051
>                                   4 2002 2043 2077 1993 2032 2065
>                                   8 1888 1959 2035 1897 1969 2088
>                                   16 1863 1934 2046 1856 1937 2066
>                                   32 1626 1796 2038 1581 1798 2068
>                                   64 1557 1709 1969 1591 1729 2084
>                                   80 1439 1619 1902 1561 1706 2059
>                                   96 1281 1452 1722 1500 1689 2005
>                                   128 677 835 1276 893 1671 1906
>                                   Here, we see that even bandwidth
>                                   suffers "out of the box" as the
>                                   number of hard-spinning processes
>                                   increases. Note the degradation in
>                                   "out-of-box" average bandwidths as
>                                   np increases. In contrast, the
>                                   "fastpath" average holds up well.
>                                   (The np=128 min fastpath number 893
>                                   Mbyte/sec is poor, but analysis
>                                   shows it to be a measurement outlier.)
>                                   *MPI_Sendrecv()
>                                   *We should also get these
>                                   optimizations into MPI_Sendrecv() in
>                                   order to speed up the HPCC "ring"
>                                   results. E.g., here are latencies in
>                                   μsecs for a performance measurement
>                                   based on HPCC "ring" tests.
>
>                                   
> ==================================================
>                                   np=64
>                                   natural random
>
>                                   "out of box" 11.7 10.9
>                                   fast path 8.3 6.2
>                                   fast path and 100 warmups 3.5 3.6
>                                   
> ==================================================
>                                   np=128 latency
>                                   natural random
>
>                                   "out of box" 242.9 226.1
>                                   fast path 56.6 37.0
>                                   fast path and 100 warmups 4.2 4.1
>                                   
> ==================================================
>                                   There happen to be two problems here:
>                                       o We need fast-path
>                                         optimizations in
>                                         MPI_Sendrecv() for improved
>                                         performance.
>                                       o The MPI collective operation
>                                         preceding the "ring"
>                                         measurement has "ragged" exit
>                                         times. So, the "ring" timing
>                                         starts well before all of the
>                                         processes have entered that
>                                         measurement. This is a
>                                         separate OMPI performance
>                                         problem that must be handled
>                                         as well for good HPCC results.
>                                   *Open Issues
>                                   *Here are some open issues:
>                                       o *Side effects*. Should the
>                                         sendi and recvi functions
>                                         leave any side effects if they
>                                         do not complete the specified
>                                         operation?
>                                      o
>
>                                       o To my taste, they should not.
>                                      o
>
>                                       o Currently, however, the sendi
>                                         function is expected to
>                                         allocate a descriptor if it
>                                         can, even if it cannot
>                                         complete the entire send
>                                         operation.
>                                       o *recvi**: BTL and match
>                                         header*. An in-coming message
>                                         starts with a "match header",
>                                         with such data as MPI source
>                                         rank, MPI communicator, and
>                                         MPI tag for performing MPI
>                                         message matching. Presumably,
>                                         the BTL knows nothing about
>                                         this header. Message matching
>                                         is performed, for example, via
>                                         PML callback functions. We are
>                                         aggressively trying to
>                                         optimize this code path,
>                                         however, so we should consider
>                                         alternatives to that approach.
>                                      o
>
>                                       o One alternative is simply for
>                                         the BTL to perform a
>                                         byte-by-byte comparison
>                                         between the received header
>                                         and the specified header. The
>                                         PML already tells the BTL how
>                                         many bytes are in the header.
>                                      o
>
>                                       o One problem with this approach
>                                         is that the fast path would
>                                         not be able to support the
>                                         wildcard tag MPI_ANY_TAG.
>                                      o
>
>                                       o Further, it leaves open the
>                                         question how one extracts
>                                         information (such as source or
>                                         tag) from this header for the
>                                         MPI_Status structure.
>                                      o
>
>                                       o We can imagine a variety of
>                                         solutions here, but so far
>                                         we've implemented a very
>                                         simple (even if
>                                         architecturally distasteful)
>                                         solution: we hardwire
>                                         information (previously
>                                         private to the PML) about the
>                                         match header into the BTL.
>                                      o
>
>                                       o That approach can be replaced
>                                         with other solutions.
>                                       o *MPI_Sendrecv()** support*. As
>                                         discussed earlier, we should
>                                         support fast-path
>                                         optimizations for "immediate"
>                                         send-receive operations.
>                                         Again, this may entail some
>                                         movement of current OMPI
>                                         architectural boundaries.
>                                         Other optimizations that are
>                                         needed for good HPCC results
>                                         include:
>                                             + reducing the degradation
>                                               due to hard spin waits
>                                             + improving the
>                                               performance of
>                                               collective operations
>                                               (which "artificially"
>                                               degrade HPCC "ring" test
>                                               results)
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>   


Reply via email to