Re: [OMPI devel] RFC: sm Latency

Richard Graham Sat, 17 Jan 2009 16:20:18 -0500

First, the performance improvements look really nice.
A few questions:
  - How much of an abstraction violation does this introduce ?  This looks
like the btl needs to start “knowing” about MPI level semantics.  Currently,
the btl purposefully is ulp agnostic.  I ask for 2 reasons
       - you mention having the btl look at the match header (if I
understood correctly)
       - not clear to me what you mean by returning the header to the list
if the irecv does not complete.  If it does not complete, why not just pass
the header back for further processing, if all this is happening at the pml
level ?
  - The measurements seem to be very dual process specific.  Have you looked
at the impact of these changes on other applications at the same process
count ?  “Real” apps would be interesting, but even hpl would be a good
start. 
  The current sm implementation is aimed only at small smp node count, which
was really the only relevant type of systems when this code was written 5
years ago.  For large core counts there is a rather simple change that could
be put in that is simple to implement, and will give you flat scaling for
the sort of tests you are running.  If you replace the fifo’s with a single
link list per process in shared memory, with senders to this process adding
match envelopes atomically, with each process reading its own link list
(multiple writers and single reader in non-threaded situation) there will be
only one place to poll, regardless of the number of procs involved in the
run.  One still needs other optimizations to lower the absolute latency –
perhaps what you have suggested.  If one really has all N procs trying to
write to the same fifo at once, performance will stink because of
contention, but most apps don’t have that behaviour.


Rich


On 1/17/09 1:48 AM, "Eugene Loh" <[email protected]> wrote:

> 
> 
> 
> RFC: sm Latency
> WHAT:  Introducing optimizations to reduce ping-pong latencies over the sm
> BTL. 
> 
> WHY:  This is a visible benchmark of MPI performance. We can improve
> shared-memory latencies from 30% (if hardware latency is the limiting factor)
> to 2× or more (if MPI software overhead is the limiting factor).  At high
> process counts, the improvement can be 10× or more.
> 
> WHERE:  Somewhat in the sm BTL, but very importantly also in the PML.  Changes
> can be seen in ssh://www.open-mpi.org/~tdd/hg/fastpath.
> 
> WHEN:  Upon acceptance.  In time for OMPI 1.4.
> 
> TIMEOUT:  February 6, 2009.
> 
> This RFC is being submitted by [email protected].
> WHY (details)
> The sm BTL typically has the lowest hardware latencies of any BTL.  Therefore,
> any OMPI software overhead we otherwise tolerate becomes glaringly obvious in
> sm latency measurements.
> 
> In particular, MPI pingpong latencies are oft-cited performance benchmarks,
> popular indications of the quality of an MPI implementation. Competitive
> vendor MPIs optimize this metric aggressively, both for np=2 pingpongs and for
> pairwise pingpongs for high np (like the popular HPCC performance test suite).
> 
> Performance reported by HPCC include:
> *  MPI_Send()/MPI_Recv() pingpong latency.
> *  MPI_Send()/MPI_Recv() pingpong latency as the number of connections grows.
> *  MPI_Sendrecv() latency.
> The slowdown of latency as the number of sm connections grows becomes
> increasingly important on large SMPs and ever more prevalent many-core nodes.
> 
> Other MPI implementations, such as Scali and Sun HPC ClusterTools 6,
> introduced such optimizations years ago.
> 
> Performance measurements indicate that the speedups we can expect in OMPI with
> these optimizations range from 30% (np=2 measurements where hardware is the
> bottleneck) to 2× (np=2 measurements where software is the bottleneck) to over
> 10× (large np). 
> WHAT (details)
> Introduce an optimized "fast path" for "immediate" sends and receives. Several
> actions are recommended here.
> 1.  Invoke the sm BTL sendi (send-immediate) function
> Each BTL is allowed to define a "send immediate" (sendi) function.  A BTL is
> not required to do so, however, in which case the PML calls the standard BTL
> send function. 
> 
> A sendi function has already been written for sm, but it has not been used due
> to insufficient testing.
> 
> The function should be reviewed, commented in, tested, and used.
> 
> The changes are: 
> *  File: ompi/mca/btl/sm/btl_sm.c
> *  Declaration/Definition: mca_btl_sm
> *  
> * 
> *  Comment in the mca_btl_sm_sendi symbol instead of the NULL placeholder so
> that the already existing sendi function will be discovered and used by the
> PML. 
> * 
> *  Function: mca_btl_sm_sendi()
> *  
> * 
> *  Review the existing sm sendi code. My suggestions include:
>> *  Drop the test against the eager limit since the PML calls this function
>> only when the eager limit is respected.
>> *  Make sure the function has no side effects in the case where it does not
>> complete.  See Open Issues <#OpenIssues> , the final section of this
>> document, for further discussion of "side effects".
> *  
> * 
> *  Mostly, I have reviewed the code and believe it's already suitable for use.
>> 2.  Move the sendi call up higher in the PML
>> Profiling pingpong tests, we find that not so much time is spent in the sm
>> BTL.  Rather, the PML consumes a lot of time preparing a "send request".
>> While these complex data structures are needed to track progress of a long
>> message that will be sent in multiple chunks and progressed over multiple
>> entries to and exits from the MPI library, managing this large data structure
>> for an "immediate" send (one chunk, one call) is overkill.  Latency can be
>> reduced noticeably if one bypasses this data structure. This means invoking
>> the sendi function as early as possible in the PML.
>> 
>> The changes are:
>> *  File: ompi/mca/pml/ob1/pml_ob1_isend.c
>> *  Function: mca_pml_ob1_send()
>> *  
>> * 
>> *  As soon as we enter the PML send function, try to call the BTL sendi
>> function.  If this fails for whatever reason, continue with the traditional
>> PML send code path. If it succeeds, then exit the PML and return up to the
>> calling layer without having to have wrestled with the PML send-request data
>> structure. 
>> * 
>> *  For better software management, the attempt to find and use a BTL sendi
>> function can be organized into a new mca_pml_ob1_sendi() function.
>> *  File: ompi/mca/pml/ob1/pml_ob1_sendreq.c
>> *  Function: mca_pml_ob1_send_request_start_copy()
>> *  
>> * 
>> *  Remove this attempt to call the BTL sendi function, since we've already
>> tried to do so higher up in the PML.
>>> 3.  Introduce a BTL recvi call
>>> While optimizing the send side of a pingpong operation is helpful, it is
>>> less than half the job.  At least as many savings are possible on the
>>> receive side. 
>>> 
>>> Corresponding to what we've done on the send side, on the receive side we
>>> can attempt, as soon as we've entered the PML, to call a BTL recvi
>>> (receive-immediate) function, bypassing the creation of a complex "receive
>>> request" data structure that is not needed if the receive can be completed
>>> immediately. 
>>> 
>>> Further, we can perform directed polling.  OMPI pingpong latencies grow
>>> significantly as the number of sm connections increases, while competitors
>>> (Scali, in any case) show absolutely flat latencies with increasing np.  The
>>> recvi function could check one connection for the specified receive and exit
>>> quickly if that message if found.
>>> 
>>> A BTL is granted considerable latitude in the proposed recvi functions.  The
>>> principle requirement is that the recvi either completes the specified
>>> receive completely or else behaves as if the function was not called at all.
>>> (That is, one should be able to revert to the traditional code path without
>>> having to worry about any recvi side effects.  So, for example, if the recvi
>>> function encounters any fragments being returned to the process, it is
>>> permitted to return those fragments to the free list.)
>>> 
>>> While those are the "hard requirements" for recvi, there are also some loose
>>> guidelines.  Mostly, it is understood that recvi should return "quickly" (a
>>> loose term to be interpreted by the BTL).  If recvi can quickly complete the
>>> specified receive, great!  If not, it should return control to the PML, who
>>> can then execute the traditional code path, which can handle long messages
>>> (multiple chunks, multiple entries into the MPI library) and execute other
>>> "progress" functions.
>>> 
>>> The changes are:
>>> *  File: ompi/mca/btl/btl.h
>>> *  
>>> * 
>>> *  In this file, we add a typedef declaration for what a generic recvi
>>> should look like:
>>> * 
>>> *      typedef int (*mca_btl_base_module_recvi_fn_t)();
>>> *      
>>> *  
>>> * 
>>> *  We also add a btl_recvi field so that a BTL can register its recvi
>>> function, if any.
>>> *  File: 
>>> *  ompi/mca/btl/elan/btl_elan.c
>>> *  ompi/mca/btl/gm/btl_gm.c
>>> *  ompi/mca/btl/mx/btl_mx.c
>>> *  ompi/mca/btl/ofud/btl_ofud.c
>>> *  ompi/mca/btl/openib/btl_openib.c
>>> *  ompi/mca/btl/portals/btl_portals.c
>>> *  ompi/mca/btl/sctp/btl_sctp.c
>>> *  ompi/mca/btl/self/btl_self.c
>>> *  ompi/mca/btl/sm/btl_sm.c
>>> *  ompi/mca/btl/tcp/btl_tcp.c
>>> *  ompi/mca/btl/template/btl_template.c
>>> *  ompi/mca/btl/udapl/btl_udapl.c
>>> *  
>>> * 
>>> *  Each BTL must add a recvi field to its module. In most cases, BTLs will
>>> not define a recvi function, and the field will be set to NULL.
>>> *  File: ompi/mca/btl/sm/btl_sm.c
>>> *  Function: mca_btl_sm_recvi()
>>> *  
>>> * 
>>> *  For the sm BTL, we set the field to the name of the BTL's recvi function:
>>> mca_btl_sm_recvi. We also add code to define the behavior of the function.
>>> *  File: ompi/mca/btl/sm/btl_sm.h
>>> *  Prototype: mca_btl_sm_recvi()
>>> *  
>>> * 
>>> *  We also add a prototype for the new function.
>>> *  File: ompi/mca/pml/ob1/pml_ob1_irecv.c
>>> *  Function: mca_pml_ob1_recv()
>>> *  
>>> * 
>>> *  As soon as we enter the PML, we try to find and use a BTL's recvi
>>> function.  If we succeed, we can exit the PML without having had invoked the
>>> heavy-duty PML receive-request data structure.  If we fail, we simply revert
>>> to the traditional PML receive code path, without having to worry about any
>>> side effects that the failed recvi might have left.
>>> * 
>>> *  It is helpful to contain the recvi attempt in a new mca_pml_ob1_recvi()
>>> function, which we add.
>>> *  File: ompi/class/ompi_fifo.h
>>> *  Function: ompi_fifo_probe_tail()
>>> *  
>>> * 
>>> *  We don't want recvi to leave any side effects if it encounters a message
>>> it is not prepared to handle. Therefore, we need to be able to see what is
>>> on a FIFO without popping that entry off the FIFO.  Therefore, we add this
>>> new function that probes the FIFO without disturbing it.
>>>> 4.  Introduce an "immediate" data convertor
>>>> One of our aims here is to reduce latency by bypassing expensive PML send
>>>> and receive request data structures.  Again, these structures are useful
>>>> when we intend to complete a message over multiple chunks and multiple MPI
>>>> library invocations, but are overkill for a message that can be completed
>>>> all at once. 
>>>> 
>>>> The same is true of data convertors.  Convertors pack user data into
>>>> shared-memory buffers or unpack them on the receive side. Convertors allow
>>>> a message to be sent in multiple chunks, over the course of multiple
>>>> unrelated MPI calls, and for noncontiguous datatypes.  These sophisticated
>>>> data structures are overkill in some important cases, such as messages that
>>>> are handled in a single chunk and in a single MPI call and consist of a
>>>> single contiguous block of data.
>>>> 
>>>> While data convertors are not typically too expensive, for shared-memory
>>>> latency, where all other costs have been pared back to a minimum,
>>>> convertors become noticeable -- around 10%.
>>>> 
>>>> Therefore, we recognize special cases where we can have barebones, minimal,
>>>> data convertors.  In these cases, we initialize the convertor structure
>>>> minimally -- e.g., a buffer address, a number of bytes to copy, and a flag
>>>> indicating that all other fields are uninitialized.  If this is not
>>>> possible (e.g., because a non-contiguous user-derived datatype is being
>>>> used), the "immediate" send or receive uses data convertors normally.
>>>> 
>>>> The changes are:
>>>> *  File: ompi/datatype/convertor.h
>>>> *  
>>>> * 
>>>> *  First, we add to the convertor flags a new flag
>>>> * 
>>>> *      #define CONVERTOR_IMMEDIATE        0x10000000
>>>> *      
>>>> *  to identify a data convertor that has been initialized only minimally.
>>>> * 
>>>> *  Further, we add three new functions:
>>>>> *  ompi_convertor_immediate(): try to form an "immediate" convertor
>>>>> *  ompi_convertor_immediate_pack(): use an "immediate" convertor to pack
>>>>> *  ompi_convertor_immediate_unpack(): use an "immediate" convertor to
>>>>> unpack 
>>>> *  File: ompi/mca/btl/sm/btl_sm.c
>>>> *  Function: mca_btl_sm_sendi and mca_btl_sm_recvi
>>>> *  
>>>> * 
>>>> *  Use the "immediate" convertor routines to pack/unpack.
>>>> *  File: ompi/mca/pml/ob1/pml_ob1_isend.c and
>>>> ompi/mca/pml/ob1/pml_ob1_irecv.c
>>>> *  
>>>> * 
>>>> *  Have the PML fast path try to construct an "immediate" convertor.
>>>>> 5.  Introduce an "immediate" MPI_Sendrecv()
>>>>> The optimizations described here should be extended to MPI_Sendrecv()
>>>>> operations.  In particular, while MPI_Send() and MPI_Recv() optimizations
>>>>> improve HPCC "pingpong" latencies, we need MPI_Sendrecv() optimizations to
>>>>> improve HPCC "ring" latencies.
>>>>> 
>>>>> One challenge is the current OMPI MPI/PML interface.  Today, the OMPI MPI
>>>>> layer breaks a Sendrecv call up into Irecv/Send/Wait.  This would seem to
>>>>> defeat fast-path optimizations at least for the receive. Some options
>>>>> include: 
>>>>> *  allow the MPI layer to call "fast path" operations
>>>>> *  have the PML layer provide a Sendrecv interface
>>>>> *  have the MPI layer emit Isend/Recv/Wait and see how effectively one can
>>>>> optimize the Isend operation in the PML for the "immediate" case
>>>>> Performance Measurements:  Before Optimization
>>>>> More measurements are desirable, but here is a sampling of data that I
>>>>> happen to have from platforms that I happened to have access to.  This
>>>>> data characterizes OMPI today, without fast-path optimizations.
>>>>> OMPI versus Other MPIs
>>>>> Here are pingpong latencies, in μsec, measured with the OSU latency test
>>>>> for 0 and 8 bytes.
>>>>> 
>>>>>                        0-byte  8-byte
>>>>> 
>>>>>      OMPI              0.74    0.84  μsec
>>>>>      MPICH             0.70    0.77
>>>>> We see OMPI lagging MPICH.
>>>>> 
>>>>> Scali and HP MPI are presumably considerably faster, but I have no recent
>>>>> data. 
>>>>> 
>>>>> Among other things, one can see that there is about a 10% penalty for
>>>>> invoking data convertors.
>>>>> Scaling with Process Count
>>>>> Here are HPCC pingpong latencies from a different, older, platform.
>>>>> Though only two processes participate in the pingpong, the HPCC test
>>>>> reports that latency for different numbers of processes in the job.  We
>>>>> see that OMPI performance slows dramatically as the number of processes is
>>>>> increased. Scali (data not available) does not show such a slowdown.
>>>>> 
>>>>>      np    min    avg    max
>>>>> 
>>>>>       2   2.688  2.719  2.750 usec
>>>>>       4   2.812  2.875  3.000
>>>>>       6   2.875  3.050  3.250
>>>>>       8   2.875  3.299  3.625
>>>>>      10   2.875  3.447  3.812
>>>>>      12   3.063  3.687  4.375
>>>>>      16   2.687  4.093  5.063
>>>>>      20   2.812  4.492  6.000
>>>>>      24   3.125  5.026  6.562
>>>>>      28   3.250  5.326  7.250
>>>>>      32   3.500  5.830  8.375
>>>>>      36   3.750  6.199  8.938
>>>>>      40   4.062  6.753 10.187
>>>>> The data show large min-max variations in latency.  These variations
>>>>> happen to depend on sender and receiver ranks.  Here are latencies
>>>>> (rounded down to the nearst μsec) for the np=40 case as a function of
>>>>> sender and receiver rank:
>>>>> 
>>>>>                                     ---------   rank of one process
>>>>> ----------->
>>>>> 
>>>>>                      - 9 9 9 9 9 9 9 9 9 9 9 9 9 8 8 7 7 7 7 7 6 7 8 7 7 7
>>>>> 7 7 6 7 7 7 6 7 7 7 7 6 7
>>>>>                      9 - 9 9 9 9 9 9 9 9 8 8 8 8 8 8 7 7 7 7 8 7 7 7 7 7 6
>>>>> 7 7 7 7 7 6 7 6 7 7 7 7 7
>>>>>                      9 9 - 9 9 9 9 9 9 9 8 9 7 7 7 8 9 7 7 7 7 7 7 7 7 7 6
>>>>> 7 8 6 7 7 7 7 7 7 6 7 7 6
>>>>>                      9 910 - 9 9 9 8 8 8 7 9 7 8 7 7 7 8 8 7 7 8 7 7 6 7 7
>>>>> 7 7 7 6 6 7 6 7 7 7 7 7 7
>>>>>                      9 9 9 9 - 9 9 9 8 8 8 7 7 8 7 8 8 8 7 7 7 8 8 7 6 6 7
>>>>> 8 7 7 6 6 7 7 6 7 7 6 7 7
>>>>>                      9 9 9 9 9 - 9 9 9 8 7 7 8 8 8 7 8 7 7 8 8 6 7 7 6 7 7
>>>>> 7 7 6 6 6 7 7 7 7 6 6 6 6
>>>>>                      9 9 9 9 9 9 - 9 9 8 9 8 8 8 7 8 8 7 8 6 7 7 7 7 7 7 6
>>>>> 6 7 7 6 7 6 7 6 7 7 6 7 6
>>>>>                      9 9 9 9 9 9 9 - 9 8 8 8 8 9 8 7 8 7 8 7 7 6 7 7 7 7 7
>>>>> 6 7 7 7 7 7 7 7 7 6 7 7 7
>>>>>                      9 9 8 9 9 8 8 9 - 7 9 9 9 7 7 7 8 8 8 7 7 7 6 7 7 7 6
>>>>> 7 6 6 6 6 7 6 7 6 6 6 7 6
>>>>>                      9 9 9 9 7 7 8 8 8 - 8 9 8 7 7 7 8 7 7 7 7 7 7 7 7 7 6
>>>>> 6 7 6 7 6 7 7 6 7 7 6 6 6
>>>>>                      9 9 9 9 9 8 9 9 7 9 - 8 7 8 7 7 6 8 7 7 7 6 7 7 7 7 7
>>>>> 7 6 6 6 6 7 7 7 6 6 7 7 6
>>>>>            |         9 8 8 9 8 7 8 8 8 8 7 - 9 7 7 8 7 7 7 7 7 7 7 6 6 6 7
>>>>> 6 7 6 6 6 7 7 6 6 7 6 7 5
>>>>>            |         8 8 9 8 9 7 7 8 8 7 7 8 - 7 8 9 8 7 7 7 6 6 7 7 7 7 7
>>>>> 6 7 6 7 7 7 6 7 6 6 6 6 6
>>>>>            |         8 8 8 8 8 9 7 8 8 7 7 7 7 - 8 8 8 8 7 7 7 6 7 7 7 6 6
>>>>> 6 6 7 7 7 7 6 6 6 6 6 5 6
>>>>>            |         6 7 9 9 9 7 7 8 7 7 8 7 8 7 - 6 8 7 7 7 8 7 7 7 7 6 6
>>>>> 7 7 7 6 7 6 7 7 6 6 6 4 5
>>>>>            |         7 7 6 8 7 8 8 8 7 7 8 7 8 9 7 - 7 7 7 8 7 7 6 7 7 7 7
>>>>> 6 7 6 7 6 6 6 6 6 6 5 5 5
>>>>>                      7 9 7 8 9 7 8 7 8 8 8 7 7 7 7 7 - 7 8 7 8 7 7 7 7 7 6
>>>>> 7 7 6 6 7 6 6 6 4 5 5 5 5
>>>>>          rank        8 8 8 7 9 7 8 7 7 7 8 7 7 7 7 7 8 - 7 7 7 7 7 7 7 6 7
>>>>> 7 7 6 6 7 7 6 6 6 6 5 4 5
>>>>>           of         7 7 7 8 6 8 6 7 8 7 6 7 7 7 7 7 7 7 - 7 7 7 7 7 7 6 6
>>>>> 7 6 6 6 6 6 6 6 6 6 5 5 4
>>>>>          the         8 7 8 8 7 8 8 7 7 7 7 7 7 7 7 7 7 7 8 - 7 7 7 7 7 7 7
>>>>> 6 7 6 6 6 6 5 5 5 5 5 4 4
>>>>>         other        8 7 7 8 7 7 7 7 8 7 7 7 8 7 7 7 7 7 7 7 - 7 7 6 7 7 7
>>>>> 7 6 6 7 6 6 6 5 5 5 5 5 5
>>>>>        process       7 6 6 7 7 7 8 7 7 6 6 7 6 7 6 7 8 7 7 8 7 - 7 7 7 7 7
>>>>> 7 6 6 6 6 6 5 5 5 4 4 4 4
>>>>>                      7 8 7 7 7 7 7 7 8 8 7 7 7 7 7 6 7 6 7 7 7 7 - 7 7 7 7
>>>>> 6 6 6 4 5 5 6 4 4 4 6 5 5
>>>>>            |         7 6 7 7 7 6 6 7 6 8 7 8 7 7 7 7 7 7 7 7 7 7 7 - 7 6 6
>>>>> 6 6 5 5 5 6 5 4 4 5 5 4 4
>>>>>            |         7 7 7 6 7 7 7 7 8 7 6 7 6 6 7 6 6 6 6 7 6 7 7 7 - 6 6
>>>>> 6 5 5 5 5 5 4 4 5 6 4 5 4
>>>>>            |         6 7 7 7 7 7 7 7 8 8 8 7 7 7 6 7 7 7 6 6 7 7 7 6 5 - 6
>>>>> 5 6 6 5 5 5 4 5 5 5 4 4 4
>>>>>            |         7 7 6 7 7 7 7 8 7 7 7 7 6 7 7 7 7 7 6 7 6 6 6 5 5 4 -
>>>>> 5 5 5 4 5 5 5 4 5 5 4 4 4
>>>>>            |         7 7 7 8 7 6 7 6 7 7 7 7 7 6 7 7 7 7 6 6 6 6 6 4 6 4 5
>>>>> - 5 4 4 5 4 4 5 5 5 4 4 4
>>>>>            V         7 6 8 7 7 6 6 7 6 7 7 7 7 7 6 7 7 6 6 6 7 6 6 5 6 5 5
>>>>> 4 - 4 5 5 4 4 4 4 4 4 4 5
>>>>>                      6 6 6 6 6 6 7 8 7 6 7 7 7 7 6 6 7 6 6 5 5 6 6 5 5 6 5
>>>>> 5 4 - 5 4 4 4 4 4 4 6 4 4
>>>>>                      6 6 6 7 6 7 7 7 7 6 7 7 6 6 7 7 7 6 6 6 6 6 5 4 4 4 5
>>>>> 4 4 4 - 5 5 4 4 4 4 4 4 4
>>>>>                      7 6 7 6 6 6 7 7 7 6 7 7 6 6 6 7 6 6 6 5 6 5 5 5 5 4 4
>>>>> 4 5 5 6 - 4 4 4 4 4 4 4 4
>>>>>                      7 7 6 6 6 6 6 7 7 7 6 7 6 7 7 7 6 6 5 5 4 5 5 4 4 4 4
>>>>> 5 4 4 5 4 - 4 4 4 5 4 4 4
>>>>>                      7 6 7 6 6 6 6 6 7 7 7 7 6 7 6 6 6 6 6 5 5 4 5 4 4 4 4
>>>>> 4 4 4 4 4 4 - 5 4 4 4 4 5
>>>>>                      7 6 7 7 7 8 7 7 6 6 6 7 6 6 6 6 5 5 4 5 5 5 4 4 5 4 4
>>>>> 4 4 4 4 4 4 4 - 4 4 4 4 4
>>>>>                      7 6 7 6 7 6 6 6 6 6 7 7 6 6 6 6 5 5 5 4 4 4 4 4 5 4 4
>>>>> 4 4 4 4 4 4 4 4 - 4 4 4 4
>>>>>                      8 6 6 7 7 7 7 8 7 6 6 7 6 6 6 6 5 4 5 4 5 5 4 5 4 4 5
>>>>> 4 4 4 4 5 5 4 4 4 - 4 4 4
>>>>>                      7 7 7 6 7 7 6 7 6 6 7 6 6 6 6 5 4 5 4 5 4 4 4 4 4 4 4
>>>>> 4 4 5 4 4 4 4 4 4 4 - 4 4
>>>>>                      7 7 7 7 7 6 7 7 6 7 7 7 7 5 4 5 5 4 5 5 4 4 4 4 4 4 4
>>>>> 4 4 4 4 4 4 4 4 4 4 4 - 4
>>>>>                      7 6 7 7 6 7 6 6 6 6 6 6 6 6 5 5 6 4 4 5 4 5 4 5 4 4 4
>>>>> 4 5 4 4 4 5 4 4 4 4 4 4 -
>>>>> We see that there is a strong dependence on process rank. Presumably, this
>>>>> is due to our polling loop.  That is, even if we receive our message, we
>>>>> still have to poll the higher numbered ranks before we complete the
>>>>> receive operation.
>>>>> Performance Measurements:  After Optimization
>>>>> We consider three metrics:
>>>>> *  HPCC "pingpong" latency
>>>>> *  OSU latency (0 bytes)
>>>>> *  OSU latency (8 bytes)
>>>>> We report data for:
>>>>> *  OMPI "out of the box"
>>>>> *  after implementation of steps 1-2 (send side)
>>>>> *  after implementation of steps 1-3 (send and receive sides)
>>>>> *  after implementation of steps 1-4 (send and receive sides, plus data
>>>>> convertor) 
>>>>> The data are from machines that I just happened to have available.
>>>>> 
>>>>> There is a bit of noise in these results, but the implications, based on
>>>>> these and other measurements, are:
>>>>> *  There is some improvement from the send side.
>>>>> *  There is more improvement from the receive side.
>>>>> *  The data convertor improvements help a little more (a few percent) for
>>>>> non-null messages.
>>>>> *  The degree of improvement depends on how fast the CPU is relative to
>>>>> the memory --  that is, how important software overheads are versus
>>>>> hardware latency.
>>>>>> *  If the CPU is fast (and hardware latency is the bottleneck), these
>>>>>> improvements are less -- say, 20-30%.
>>>>>> *  If the CPU is slow (and software costs are the bottleneck), the
>>>>>> improvements are more dramatic -- nearly a factor of 2 for non-null
>>>>>> messages. 
>>>>> *  As np is increased, latency stays flat.  This can represent a 10× or
>>>>> more improvement over out-of-the-box OMPI.
>>>>> V20z
>>>>> Here are results for a V20z (burl-ct-v20z-11):
>>>>> 
>>>>>                   HPCC OSU0 OSU8
>>>>> 
>>>>>    out of box      838  770  850   nsec
>>>>>    Steps 1-2       862  770  860
>>>>>    Steps 1-3       670  610  670
>>>>>    Steps 1-4       642  580  610
>>>>> F6900
>>>>> Here are np=2 results from a 1.05-GHz (1.2?) UltraSPARC-IV F6900 server:
>>>>> 
>>>>>                   HPCC OSU0 OSU8
>>>>> 
>>>>>    out of box     3430 2770 3340   nsec
>>>>>    Steps 1-2      2940 2660 3090
>>>>>    Steps 1-3      1854 1650 1880
>>>>>    Steps 1-4      1660 1640 1750
>>>>> Here is the dependence on process count using HPCC:
>>>>> 
>>>>>                    OMPI
>>>>>              "out of the box"            optimized
>>>>>  comm       -----------------       -----------------
>>>>>  size         min   avg   max         min   avg   max
>>>>> 
>>>>>     2        2688  2719  2750        1750  1781  1812   nsec
>>>>>     4        2812  2875  3000        1750  1802  1812
>>>>>     6        2875  3050  3250        1687  1777  1812
>>>>>     8        2875  3299  3625        1687  1773  1812
>>>>>    10        2875  3447  3812        1687  1789  1812
>>>>>    12        3063  3687  4375        1687  1796  1813
>>>>>    16        2687  4093  5063        1500  1784  1875
>>>>>    20        2812  4492  6000        1687  1788  1875
>>>>>    24        3125  5026  6562        1562  1776  1875
>>>>>    28        3250  5326  7250        1500  1764  1813
>>>>>    32        3500  5830  8375        1562  1755  1875
>>>>>    36        3750  6199  8938        1562  1755  1875
>>>>>    40        4062  6753 10187        1500  1742  1812
>>>>> Note: 
>>>>> *  At np=2, these optimizations lead to a 2× improvement in shared-memory
>>>>> latency. 
>>>>> *  Non-null messages incur more than a 10% penalty, which is largely
>>>>> addressed by our data-convertor optimization.
>>>>> *  At larger np, we maintain our fast performance while OMPI "out of the
>>>>> box" keeps slowing down more and more and more.
>>>>> M9000
>>>>> Here are results for a 128-core M9000.  I think the system has:
>>>>> *  2 hardware threads per core (but we only use one hardware thread per
>>>>> core) 
>>>>> *  4 cores per socket
>>>>> *  4 sockets per board
>>>>> *  4 boards per (half?)
>>>>> *  2 (halves?) per system
>>>>> As one separates the sender and receiver, hardware latency increases. Here
>>>>> is the hierarchy:
>>>>> 
>>>>>                      latency (nsec)       bandwidth (Mbyte/sec)
>>>>>                  out-of-box  fastpath     out-of-box  fastpath
>>>>>    (on-socket?)      810        480          2000       2000
>>>>>     (on-board?)     2050       1820          1900       1900
>>>>>         (half?)     3030       2840          1680       1680
>>>>>                     3150       2960          1660       1660
>>>>> Note: 
>>>>> *  Latency benefits some hundreds of nsecs with fastpath.
>>>>> *  This latency improvement is striking when the hardware latency is
>>>>> small, but less noticeable as as the hardware latency increases.
>>>>> *  Bandwidth is not very sensitive to hardware latency (due to prefetch)
>>>>> and not at all to fast-path optimizations.
>>>>> Here are HPCC pingpong latencies for increasing process counts:
>>>>> 
>>>>>              out-of-box             fastpath
>>>>>      np  -----------------     -----------------
>>>>>          min    avg    max     min    avg    max
>>>>> 
>>>>>       2  812    812    812     499    499    499
>>>>>       4  874    921    999     437    494    562
>>>>>       8  937   1847   2624     437   1249   1874
>>>>>      16 1062   2430   2937     437   1557   1937
>>>>>      32 1562   3850   5437     375   2211   2875
>>>>>      64 2687   8329  15874     437   2535   3062
>>>>>      80 3499  16854  41749     374   2647   3437
>>>>>      96 3812  31159 100812     374   2717   3437
>>>>>     128 5187 125774 335187     437   2793   3499
>>>>> The improvements are tremendous:
>>>>> *  At low np, latencies are low since sender and receiver can be
>>>>> colocated.  Nevertheless, fast-path optimizations provided a nearly 2×
>>>>> improvement. 
>>>>> *  As np increases, fast-path latency also increases, but this is due to
>>>>> higher hardware latencies.  Indeed, the "min" numbers even drop a little.
>>>>> The "max" fast-path numbers basically only represent the increase in
>>>>> hardware latency.
>>>>> *  As np increases, OMPI "out of the box" latency suffers
>>>>> catastrophically.  Not only is there the issue of more connections to
>>>>> poll, but the polling behaviors of non-participating processes wreak havoc
>>>>> on the performance of measured processes.
>>>>> * 
>>>>> *  We can separate the two sources of latency degradation by putting the
>>>>> np-2 non-participating processes to sleep. In that case, latency only
>>>>> rises to about 10-20 μsec. So, polling of many connections causes a
>>>>> substantial rise in latency, while the disturbance of hard-poll loops on
>>>>> system performance is responsible for even more degradation.
>>>>>> Actually, even bandwidth benefits:
>>>>>> 
>>>>>>             out-of-box          fastpath
>>>>>>      np   --------------      -------------
>>>>>>            min  avg  max      min  avg  max
>>>>>> 
>>>>>>       2   2015 2034 2053     2028 2039 2051
>>>>>>       4   2002 2043 2077     1993 2032 2065
>>>>>>       8   1888 1959 2035     1897 1969 2088
>>>>>>      16   1863 1934 2046     1856 1937 2066
>>>>>>      32   1626 1796 2038     1581 1798 2068
>>>>>>      64   1557 1709 1969     1591 1729 2084
>>>>>>      80   1439 1619 1902     1561 1706 2059
>>>>>>      96   1281 1452 1722     1500 1689 2005
>>>>>>     128    677  835 1276      893 1671 1906
>>>>>> Here, we see that even bandwidth suffers "out of the box" as the number
>>>>>> of hard-spinning processes increases.  Note the degradation in
>>>>>> "out-of-box" average bandwidths as np increases.  In contrast, the
>>>>>> "fastpath" average holds up well. (The np=128 min fastpath number 893
>>>>>> Mbyte/sec is poor, but analysis shows it to be a measurement outlier.)
>>>>>> MPI_Sendrecv()
>>>>>> We should also get these optimizations into MPI_Sendrecv() in order to
>>>>>> speed up the HPCC "ring" results.  E.g., here are latencies in μsecs for
>>>>>> a performance measurement based on HPCC "ring" tests.
>>>>>> 
>>>>>> ==================================================
>>>>>> np=64
>>>>>>                                             natural random
>>>>>> 
>>>>>> "out of box"                                   11.7   10.9
>>>>>> fast path                                       8.3    6.2
>>>>>> fast path and 100 warmups                       3.5    3.6
>>>>>> ==================================================
>>>>>> np=128 latency
>>>>>>                                             natural random
>>>>>> 
>>>>>> "out of box"                                  242.9  226.1
>>>>>> fast path                                      56.6   37.0
>>>>>> fast path and 100 warmups                       4.2    4.1
>>>>>> ==================================================
>>>>>> There happen to be two problems here:
>>>>>> *  We need fast-path optimizations in MPI_Sendrecv() for improved
>>>>>> performance.
>>>>>> *  The MPI collective operation preceding the "ring" measurement has
>>>>>> "ragged" exit times.  So, the "ring" timing starts well before all of the
>>>>>> processes have entered that measurement. This is a separate OMPI
>>>>>> performance problem that must be handled as well for good HPCC results.
>>>>>> Open Issues
>>>>>> Here are some open issues:
>>>>>> *  Side effects.  Should the sendi and recvi functions leave any side
>>>>>> effects if they do not complete the specified operation?
>>>>>> * 
>>>>>> *  To my taste, they should not.
>>>>>> * 
>>>>>> *  Currently, however, the sendi function is expected to allocate a
>>>>>> descriptor if it can, even if it cannot complete the entire send
>>>>>> operation. 
>>>>>> *  recvi:  BTL and match header. An in-coming message starts with a
>>>>>> "match header", with such data as MPI source rank, MPI communicator, and
>>>>>> MPI tag for performing MPI message matching.  Presumably, the BTL knows
>>>>>> nothing about this header.  Message matching is performed, for example,
>>>>>> via PML callback functions.  We are aggressively trying to optimize this
>>>>>> code path, however, so we should consider alternatives to that approach.
>>>>>> * 
>>>>>> *  One alternative is simply for the BTL to perform a byte-by-byte
>>>>>> comparison between the received header and the specified header.  The PML
>>>>>> already tells the BTL how many bytes are in the header.
>>>>>> * 
>>>>>> *  One problem with this approach is that the fast path would not be able
>>>>>> to support the wildcard tag MPI_ANY_TAG.
>>>>>> * 
>>>>>> *  Further, it leaves open the question how one extracts information
>>>>>> (such as source or tag) from this header for the MPI_Status structure.
>>>>>> * 
>>>>>> *  We can imagine a variety of solutions here, but so far we've
>>>>>> implemented a very simple (even if architecturally distasteful) solution:
>>>>>> we hardwire information (previously private to the PML) about the match
>>>>>> header into the BTL.
>>>>>> * 
>>>>>> *  That approach can be replaced with other solutions.
>>>>>> *  MPI_Sendrecv() support.  As discussed earlier, we should support
>>>>>> fast-path optimizations for "immediate" send-receive operations.  Again,
>>>>>> this may entail some movement of current OMPI architectural boundaries.
>>>>>>> Other optimizations that are needed for good HPCC results include:
>>>>>>> *  reducing the degradation due to hard spin waits
>>>>>>> *  improving the performance of collective operations (which
>>>>>>> "artificially" degrade HPCC "ring" test results)
>>>>>>>

Re: [OMPI devel] RFC: sm Latency

Reply via email to