[OMPI devel] RFC: sm Latency

Eugene Loh Sat, 17 Jan 2009 01:43:57 -0500

Title: RFC: sm Latency

RFC: `sm` Latency

WHAT: Introducing optimizations to reduce ping-pong latencies over the sm BTL.

WHY: This is a visible benchmark of MPI performance. We can improve shared-memory latencies from 30% (if hardware latency is the limiting factor) to 2× or more (if MPI software overhead is the limiting factor). At high process counts, the improvement can be 10× or more.

WHERE: Somewhat in the sm BTL, but very importantly also in the PML. Changes can be seen in ssh://www.open-mpi.org/~tdd/hg/fastpath.

WHEN: Upon acceptance. In time for OMPI 1.4.

TIMEOUT: February 6, 2009.

This RFC is being submitted by eugene....@sun.com.

WHY (details)

The sm BTL typically has the lowest hardware latencies of any BTL. Therefore, any OMPI software overhead we otherwise tolerate becomes glaringly obvious in sm latency measurements.

In particular, MPI pingpong latencies are oft-cited performance benchmarks, popular indications of the quality of an MPI implementation. Competitive vendor MPIs optimize this metric aggressively, both for np=2 pingpongs and for pairwise pingpongs for high np (like the popular HPCC performance test suite).

Performance reported by HPCC include:

MPI_Send()/MPI_Recv() pingpong latency.
MPI_Send()/MPI_Recv() pingpong latency as the number of connections grows.
MPI_Sendrecv() latency.

The slowdown of latency as the number of sm connections grows becomes increasingly important on large SMPs and ever more prevalent many-core nodes.

Other MPI implementations, such as Scali and Sun HPC ClusterTools 6, introduced such optimizations years ago.

Performance measurements indicate that the speedups we can expect in OMPI with these optimizations range from 30% (np=2 measurements where hardware is the bottleneck) to 2× (np=2 measurements where software is the bottleneck) to over 10× (large np).

WHAT (details)

Introduce an optimized "fast path" for "immediate" sends and receives. Several actions are recommended here.

1. Invoke the `sm` BTL `sendi` (send-immediate) function

Each BTL is allowed to define a "send immediate" (sendi) function. A BTL is not required to do so, however, in which case the PML calls the standard BTL send function.

A sendi function has already been written for sm, but it has not been used due to insufficient testing.

The function should be reviewed, commented in, tested, and used.

The changes are:

File: ompi/mca/btl/sm/btl_sm.c
Declaration/Definition: mca_btl_sm

Comment in the mca_btl_sm_sendi symbol instead of the NULL placeholder so that the already existing sendi function will be discovered and used by the PML.
Function: mca_btl_sm_sendi()

Review the existing sm sendi code. My suggestions include:
- Drop the test against the eager limit since the PML calls this function only when the eager limit is respected.
- Make sure the function has no side effects in the case where it does not complete. See Open Issues, the final section of this document, for further discussion of "side effects".
Mostly, I have reviewed the code and believe it's already suitable for use.

2. Move the `sendi` call up higher in the PML

Profiling pingpong tests, we find that not so much time is spent in the sm BTL. Rather, the PML consumes a lot of time preparing a "send request". While these complex data structures are needed to track progress of a long message that will be sent in multiple chunks and progressed over multiple entries to and exits from the MPI library, managing this large data structure for an "immediate" send (one chunk, one call) is overkill. Latency can be reduced noticeably if one bypasses this data structure. This means invoking the sendi function as early as possible in the PML.

The changes are:

File: ompi/mca/pml/ob1/pml_ob1_isend.c
Function: mca_pml_ob1_send()

As soon as we enter the PML send function, try to call the BTL sendi function. If this fails for whatever reason, continue with the traditional PML send code path. If it succeeds, then exit the PML and return up to the calling layer without having to have wrestled with the PML send-request data structure.
For better software management, the attempt to find and use a BTL sendi function can be organized into a new mca_pml_ob1_sendi() function.
File: ompi/mca/pml/ob1/pml_ob1_sendreq.c
Function: mca_pml_ob1_send_request_start_copy()

Remove this attempt to call the BTL sendi function, since we've already tried to do so higher up in the PML.

3. Introduce a BTL `recvi` call

While optimizing the send side of a pingpong operation is helpful, it is less than half the job. At least as many savings are possible on the receive side.

Corresponding to what we've done on the send side, on the receive side we can attempt, as soon as we've entered the PML, to call a BTL recvi (receive-immediate) function, bypassing the creation of a complex "receive request" data structure that is not needed if the receive can be completed immediately.

Further, we can perform directed polling. OMPI pingpong latencies grow significantly as the number of sm connections increases, while competitors (Scali, in any case) show absolutely flat latencies with increasing np. The recvi function could check one connection for the specified receive and exit quickly if that message if found.

A BTL is granted considerable latitude in the proposed recvi functions. The principle requirement is that the recvi either completes the specified receive completely or else behaves as if the function was not called at all. (That is, one should be able to revert to the traditional code path without having to worry about any recvi side effects. So, for example, if the recvi function encounters any fragments being returned to the process, it is permitted to return those fragments to the free list.)

While those are the "hard requirements" for recvi, there are also some loose guidelines. Mostly, it is understood that recvi should return "quickly" (a loose term to be interpreted by the BTL). If recvi can quickly complete the specified receive, great! If not, it should return control to the PML, who can then execute the traditional code path, which can handle long messages (multiple chunks, multiple entries into the MPI library) and execute other "progress" functions.

The changes are:

File: ompi/mca/btl/btl.h

In this file, we add a typedef declaration for what a generic recvi should look like:
```
     typedef int (*mca_btl_base_module_recvi_fn_t)();
     
```
We also add a btl_recvi field so that a BTL can register its recvi function, if any.
File:
ompi/mca/btl/elan/btl_elan.c
ompi/mca/btl/gm/btl_gm.c
ompi/mca/btl/mx/btl_mx.c
ompi/mca/btl/ofud/btl_ofud.c
ompi/mca/btl/openib/btl_openib.c
ompi/mca/btl/portals/btl_portals.c
ompi/mca/btl/sctp/btl_sctp.c
ompi/mca/btl/self/btl_self.c
ompi/mca/btl/sm/btl_sm.c
ompi/mca/btl/tcp/btl_tcp.c
ompi/mca/btl/template/btl_template.c
ompi/mca/btl/udapl/btl_udapl.c

Each BTL must add a recvi field to its module. In most cases, BTLs will not define a recvi function, and the field will be set to NULL.
File: ompi/mca/btl/sm/btl_sm.c
Function: mca_btl_sm_recvi()

For the sm BTL, we set the field to the name of the BTL's recvi function: mca_btl_sm_recvi. We also add code to define the behavior of the function.
File: ompi/mca/btl/sm/btl_sm.h
Prototype: mca_btl_sm_recvi()

We also add a prototype for the new function.
File: ompi/mca/pml/ob1/pml_ob1_irecv.c
Function: mca_pml_ob1_recv()

As soon as we enter the PML, we try to find and use a BTL's recvi function. If we succeed, we can exit the PML without having had invoked the heavy-duty PML receive-request data structure. If we fail, we simply revert to the traditional PML receive code path, without having to worry about any side effects that the failed recvi might have left.
It is helpful to contain the recvi attempt in a new mca_pml_ob1_recvi() function, which we add.
File: ompi/class/ompi_fifo.h
Function: ompi_fifo_probe_tail()

We don't want recvi to leave any side effects if it encounters a message it is not prepared to handle. Therefore, we need to be able to see what is on a FIFO without popping that entry off the FIFO. Therefore, we add this new function that probes the FIFO without disturbing it.

4. Introduce an "immediate" data convertor

One of our aims here is to reduce latency by bypassing expensive PML send and receive request data structures. Again, these structures are useful when we intend to complete a message over multiple chunks and multiple MPI library invocations, but are overkill for a message that can be completed all at once.

The same is true of data convertors. Convertors pack user data into shared-memory buffers or unpack them on the receive side. Convertors allow a message to be sent in multiple chunks, over the course of multiple unrelated MPI calls, and for noncontiguous datatypes. These sophisticated data structures are overkill in some important cases, such as messages that are handled in a single chunk and in a single MPI call and consist of a single contiguous block of data.

While data convertors are not typically too expensive, for shared-memory latency, where all other costs have been pared back to a minimum, convertors become noticeable -- around 10%.

Therefore, we recognize special cases where we can have barebones, minimal, data convertors. In these cases, we initialize the convertor structure minimally -- e.g., a buffer address, a number of bytes to copy, and a flag indicating that all other fields are uninitialized. If this is not possible (e.g., because a non-contiguous user-derived datatype is being used), the "immediate" send or receive uses data convertors normally.

The changes are:

File: ompi/datatype/convertor.h

First, we add to the convertor flags a new flag
```
     #define CONVERTOR_IMMEDIATE        0x10000000
     
```
to identify a data convertor that has been initialized only minimally.
Further, we add three new functions:
- ompi_convertor_immediate(): try to form an "immediate" convertor
- ompi_convertor_immediate_pack(): use an "immediate" convertor to pack
- ompi_convertor_immediate_unpack(): use an "immediate" convertor to unpack
File: ompi/mca/btl/sm/btl_sm.c
Function: mca_btl_sm_sendi and mca_btl_sm_recvi

Use the "immediate" convertor routines to pack/unpack.
File: ompi/mca/pml/ob1/pml_ob1_isend.c and ompi/mca/pml/ob1/pml_ob1_irecv.c

Have the PML fast path try to construct an "immediate" convertor.

5. Introduce an "immediate" `MPI_Sendrecv()`

The optimizations described here should be extended to MPI_Sendrecv() operations. In particular, while MPI_Send() and MPI_Recv() optimizations improve HPCC "pingpong" latencies, we need MPI_Sendrecv() optimizations to improve HPCC "ring" latencies.

One challenge is the current OMPI MPI/PML interface. Today, the OMPI MPI layer breaks a Sendrecv call up into Irecv/Send/Wait. This would seem to defeat fast-path optimizations at least for the receive. Some options include:

allow the MPI layer to call "fast path" operations
have the PML layer provide a Sendrecv interface
have the MPI layer emit Isend/Recv/Wait and see how effectively one can optimize the Isend operation in the PML for the "immediate" case

Performance Measurements: Before Optimization

More measurements are desirable, but here is a sampling of data that I happen to have from platforms that I happened to have access to. This data characterizes OMPI today, without fast-path optimizations.

OMPI versus Other MPIs

Here are pingpong latencies, in µsec, measured with the OSU latency test for 0 and 8 bytes.

                       0-byte  8-byte

     OMPI              0.74    0.84  µsec
     MPICH             0.70    0.77

We see OMPI lagging MPICH.

Scali and HP MPI are presumably considerably faster, but I have no recent data.

Among other things, one can see that there is about a 10% penalty for invoking data convertors.

Scaling with Process Count

Here are HPCC pingpong latencies from a different, older, platform. Though only two processes participate in the pingpong, the HPCC test reports that latency for different numbers of processes in the job. We see that OMPI performance slows dramatically as the number of processes is increased. Scali (data not available) does not show such a slowdown.

     np    min    avg    max

      2   2.688  2.719  2.750 usec
      4   2.812  2.875  3.000
      6   2.875  3.050  3.250
      8   2.875  3.299  3.625
     10   2.875  3.447  3.812
     12   3.063  3.687  4.375
     16   2.687  4.093  5.063
     20   2.812  4.492  6.000
     24   3.125  5.026  6.562
     28   3.250  5.326  7.250
     32   3.500  5.830  8.375
     36   3.750  6.199  8.938
     40   4.062  6.753 10.187

The data show large min-max variations in latency. These variations happen to depend on sender and receiver ranks. Here are latencies (rounded down to the nearst µsec) for the np=40 case as a function of sender and receiver rank:

                                    ---------   rank of one process   ----------->

                     - 9 9 9 9 9 9 9 9 9 9 9 9 9 8 8 7 7 7 7 7 6 7 8 7 7 7 7 7 6 7 7 7 6 7 7 7 7 6 7
                     9 - 9 9 9 9 9 9 9 9 8 8 8 8 8 8 7 7 7 7 8 7 7 7 7 7 6 7 7 7 7 7 6 7 6 7 7 7 7 7
                     9 9 - 9 9 9 9 9 9 9 8 9 7 7 7 8 9 7 7 7 7 7 7 7 7 7 6 7 8 6 7 7 7 7 7 7 6 7 7 6
                     9 910 - 9 9 9 8 8 8 7 9 7 8 7 7 7 8 8 7 7 8 7 7 6 7 7 7 7 7 6 6 7 6 7 7 7 7 7 7
                     9 9 9 9 - 9 9 9 8 8 8 7 7 8 7 8 8 8 7 7 7 8 8 7 6 6 7 8 7 7 6 6 7 7 6 7 7 6 7 7
                     9 9 9 9 9 - 9 9 9 8 7 7 8 8 8 7 8 7 7 8 8 6 7 7 6 7 7 7 7 6 6 6 7 7 7 7 6 6 6 6
                     9 9 9 9 9 9 - 9 9 8 9 8 8 8 7 8 8 7 8 6 7 7 7 7 7 7 6 6 7 7 6 7 6 7 6 7 7 6 7 6
                     9 9 9 9 9 9 9 - 9 8 8 8 8 9 8 7 8 7 8 7 7 6 7 7 7 7 7 6 7 7 7 7 7 7 7 7 6 7 7 7
                     9 9 8 9 9 8 8 9 - 7 9 9 9 7 7 7 8 8 8 7 7 7 6 7 7 7 6 7 6 6 6 6 7 6 7 6 6 6 7 6
                     9 9 9 9 7 7 8 8 8 - 8 9 8 7 7 7 8 7 7 7 7 7 7 7 7 7 6 6 7 6 7 6 7 7 6 7 7 6 6 6
                     9 9 9 9 9 8 9 9 7 9 - 8 7 8 7 7 6 8 7 7 7 6 7 7 7 7 7 7 6 6 6 6 7 7 7 6 6 7 7 6
           |         9 8 8 9 8 7 8 8 8 8 7 - 9 7 7 8 7 7 7 7 7 7 7 6 6 6 7 6 7 6 6 6 7 7 6 6 7 6 7 5
           |         8 8 9 8 9 7 7 8 8 7 7 8 - 7 8 9 8 7 7 7 6 6 7 7 7 7 7 6 7 6 7 7 7 6 7 6 6 6 6 6
           |         8 8 8 8 8 9 7 8 8 7 7 7 7 - 8 8 8 8 7 7 7 6 7 7 7 6 6 6 6 7 7 7 7 6 6 6 6 6 5 6
           |         6 7 9 9 9 7 7 8 7 7 8 7 8 7 - 6 8 7 7 7 8 7 7 7 7 6 6 7 7 7 6 7 6 7 7 6 6 6 4 5
           |         7 7 6 8 7 8 8 8 7 7 8 7 8 9 7 - 7 7 7 8 7 7 6 7 7 7 7 6 7 6 7 6 6 6 6 6 6 5 5 5
                     7 9 7 8 9 7 8 7 8 8 8 7 7 7 7 7 - 7 8 7 8 7 7 7 7 7 6 7 7 6 6 7 6 6 6 4 5 5 5 5
         rank        8 8 8 7 9 7 8 7 7 7 8 7 7 7 7 7 8 - 7 7 7 7 7 7 7 6 7 7 7 6 6 7 7 6 6 6 6 5 4 5
          of         7 7 7 8 6 8 6 7 8 7 6 7 7 7 7 7 7 7 - 7 7 7 7 7 7 6 6 7 6 6 6 6 6 6 6 6 6 5 5 4
         the         8 7 8 8 7 8 8 7 7 7 7 7 7 7 7 7 7 7 8 - 7 7 7 7 7 7 7 6 7 6 6 6 6 5 5 5 5 5 4 4
        other        8 7 7 8 7 7 7 7 8 7 7 7 8 7 7 7 7 7 7 7 - 7 7 6 7 7 7 7 6 6 7 6 6 6 5 5 5 5 5 5
       process       7 6 6 7 7 7 8 7 7 6 6 7 6 7 6 7 8 7 7 8 7 - 7 7 7 7 7 7 6 6 6 6 6 5 5 5 4 4 4 4
                     7 8 7 7 7 7 7 7 8 8 7 7 7 7 7 6 7 6 7 7 7 7 - 7 7 7 7 6 6 6 4 5 5 6 4 4 4 6 5 5
           |         7 6 7 7 7 6 6 7 6 8 7 8 7 7 7 7 7 7 7 7 7 7 7 - 7 6 6 6 6 5 5 5 6 5 4 4 5 5 4 4
           |         7 7 7 6 7 7 7 7 8 7 6 7 6 6 7 6 6 6 6 7 6 7 7 7 - 6 6 6 5 5 5 5 5 4 4 5 6 4 5 4
           |         6 7 7 7 7 7 7 7 8 8 8 7 7 7 6 7 7 7 6 6 7 7 7 6 5 - 6 5 6 6 5 5 5 4 5 5 5 4 4 4
           |         7 7 6 7 7 7 7 8 7 7 7 7 6 7 7 7 7 7 6 7 6 6 6 5 5 4 - 5 5 5 4 5 5 5 4 5 5 4 4 4
           |         7 7 7 8 7 6 7 6 7 7 7 7 7 6 7 7 7 7 6 6 6 6 6 4 6 4 5 - 5 4 4 5 4 4 5 5 5 4 4 4
           V         7 6 8 7 7 6 6 7 6 7 7 7 7 7 6 7 7 6 6 6 7 6 6 5 6 5 5 4 - 4 5 5 4 4 4 4 4 4 4 5
                     6 6 6 6 6 6 7 8 7 6 7 7 7 7 6 6 7 6 6 5 5 6 6 5 5 6 5 5 4 - 5 4 4 4 4 4 4 6 4 4
                     6 6 6 7 6 7 7 7 7 6 7 7 6 6 7 7 7 6 6 6 6 6 5 4 4 4 5 4 4 4 - 5 5 4 4 4 4 4 4 4
                     7 6 7 6 6 6 7 7 7 6 7 7 6 6 6 7 6 6 6 5 6 5 5 5 5 4 4 4 5 5 6 - 4 4 4 4 4 4 4 4
                     7 7 6 6 6 6 6 7 7 7 6 7 6 7 7 7 6 6 5 5 4 5 5 4 4 4 4 5 4 4 5 4 - 4 4 4 5 4 4 4
                     7 6 7 6 6 6 6 6 7 7 7 7 6 7 6 6 6 6 6 5 5 4 5 4 4 4 4 4 4 4 4 4 4 - 5 4 4 4 4 5
                     7 6 7 7 7 8 7 7 6 6 6 7 6 6 6 6 5 5 4 5 5 5 4 4 5 4 4 4 4 4 4 4 4 4 - 4 4 4 4 4
                     7 6 7 6 7 6 6 6 6 6 7 7 6 6 6 6 5 5 5 4 4 4 4 4 5 4 4 4 4 4 4 4 4 4 4 - 4 4 4 4
                     8 6 6 7 7 7 7 8 7 6 6 7 6 6 6 6 5 4 5 4 5 5 4 5 4 4 5 4 4 4 4 5 5 4 4 4 - 4 4 4
                     7 7 7 6 7 7 6 7 6 6 7 6 6 6 6 5 4 5 4 5 4 4 4 4 4 4 4 4 4 5 4 4 4 4 4 4 4 - 4 4
                     7 7 7 7 7 6 7 7 6 7 7 7 7 5 4 5 5 4 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 - 4
                     7 6 7 7 6 7 6 6 6 6 6 6 6 6 5 5 6 4 4 5 4 5 4 5 4 4 4 4 5 4 4 4 5 4 4 4 4 4 4 -

We see that there is a strong dependence on process rank. Presumably, this is due to our polling loop. That is, even if we receive our message, we still have to poll the higher numbered ranks before we complete the receive operation.

Performance Measurements: After Optimization

We consider three metrics:

HPCC "pingpong" latency
OSU latency (0 bytes)
OSU latency (8 bytes)

We report data for:

OMPI "out of the box"
after implementation of steps 1-2 (send side)
after implementation of steps 1-3 (send and receive sides)
after implementation of steps 1-4 (send and receive sides, plus data convertor)

The data are from machines that I just happened to have available.

There is a bit of noise in these results, but the implications, based on these and other measurements, are:

There is some improvement from the send side.
There is more improvement from the receive side.
The data convertor improvements help a little more (a few percent) for non-null messages.
The degree of improvement depends on how fast the CPU is relative to the memory -- that is, how important software overheads are versus hardware latency.
- If the CPU is fast (and hardware latency is the bottleneck), these improvements are less -- say, 20-30%.
- If the CPU is slow (and software costs are the bottleneck), the improvements are more dramatic -- nearly a factor of 2 for non-null messages.
As np is increased, latency stays flat. This can represent a 10× or more improvement over out-of-the-box OMPI.

V20z

Here are results for a V20z (burl-ct-v20z-11):

                  HPCC OSU0 OSU8

   out of box      838  770  850   nsec
   Steps 1-2       862  770  860
   Steps 1-3       670  610  670
   Steps 1-4       642  580  610

F6900

Here are np=2 results from a 1.05-GHz (1.2?) UltraSPARC-IV F6900 server:

                  HPCC OSU0 OSU8

   out of box     3430 2770 3340   nsec
   Steps 1-2      2940 2660 3090
   Steps 1-3      1854 1650 1880
   Steps 1-4      1660 1640 1750

Here is the dependence on process count using HPCC:

                   OMPI
             "out of the box"            optimized
 comm       -----------------       -----------------
 size         min   avg   max         min   avg   max

    2        2688  2719  2750        1750  1781  1812   nsec
    4        2812  2875  3000        1750  1802  1812
    6        2875  3050  3250        1687  1777  1812
    8        2875  3299  3625        1687  1773  1812
   10        2875  3447  3812        1687  1789  1812
   12        3063  3687  4375        1687  1796  1813
   16        2687  4093  5063        1500  1784  1875
   20        2812  4492  6000        1687  1788  1875
   24        3125  5026  6562        1562  1776  1875
   28        3250  5326  7250        1500  1764  1813
   32        3500  5830  8375        1562  1755  1875
   36        3750  6199  8938        1562  1755  1875
   40        4062  6753 10187        1500  1742  1812

Note:

At np=2, these optimizations lead to a 2× improvement in shared-memory latency.
Non-null messages incur more than a 10% penalty, which is largely addressed by our data-convertor optimization.
At larger np, we maintain our fast performance while OMPI "out of the box" keeps slowing down more and more and more.

M9000

Here are results for a 128-core M9000. I think the system has:

2 hardware threads per core (but we only use one hardware thread per core)
4 cores per socket
4 sockets per board
4 boards per (half?)
2 (halves?) per system

As one separates the sender and receiver, hardware latency increases. Here is the hierarchy:

                     latency (nsec)       bandwidth (Mbyte/sec)
                 out-of-box  fastpath     out-of-box  fastpath
   (on-socket?)      810        480          2000       2000
    (on-board?)     2050       1820          1900       1900
        (half?)     3030       2840          1680       1680
                    3150       2960          1660       1660

Note:

Latency benefits some hundreds of nsecs with fastpath.
This latency improvement is striking when the hardware latency is small, but less noticeable as as the hardware latency increases.
Bandwidth is not very sensitive to hardware latency (due to prefetch) and not at all to fast-path optimizations.

Here are HPCC pingpong latencies for increasing process counts:

             out-of-box             fastpath
     np  -----------------     -----------------
         min    avg    max     min    avg    max

      2  812    812    812     499    499    499
      4  874    921    999     437    494    562
      8  937   1847   2624     437   1249   1874
     16 1062   2430   2937     437   1557   1937
     32 1562   3850   5437     375   2211   2875
     64 2687   8329  15874     437   2535   3062
     80 3499  16854  41749     374   2647   3437
     96 3812  31159 100812     374   2717   3437
    128 5187 125774 335187     437   2793   3499

The improvements are tremendous:

At low np, latencies are low since sender and receiver can be colocated. Nevertheless, fast-path optimizations provided a nearly 2× improvement.
As np increases, fast-path latency also increases, but this is due to higher hardware latencies. Indeed, the "min" numbers even drop a little. The "max" fast-path numbers basically only represent the increase in hardware latency.
As np increases, OMPI "out of the box" latency suffers catastrophically. Not only is there the issue of more connections to poll, but the polling behaviors of non-participating processes wreak havoc on the performance of measured processes.
We can separate the two sources of latency degradation by putting the np-2 non-participating processes to sleep. In that case, latency only rises to about 10-20 µsec. So, polling of many connections causes a substantial rise in latency, while the disturbance of hard-poll loops on system performance is responsible for even more degradation.

Actually, even bandwidth benefits:

            out-of-box          fastpath
     np   --------------      -------------
           min  avg  max      min  avg  max

      2   2015 2034 2053     2028 2039 2051
      4   2002 2043 2077     1993 2032 2065
      8   1888 1959 2035     1897 1969 2088
     16   1863 1934 2046     1856 1937 2066
     32   1626 1796 2038     1581 1798 2068
     64   1557 1709 1969     1591 1729 2084
     80   1439 1619 1902     1561 1706 2059
     96   1281 1452 1722     1500 1689 2005
    128    677  835 1276      893 1671 1906

Here, we see that even bandwidth suffers "out of the box" as the number of hard-spinning processes increases. Note the degradation in "out-of-box" average bandwidths as np increases. In contrast, the "fastpath" average holds up well. (The np=128 min fastpath number 893 Mbyte/sec is poor, but analysis shows it to be a measurement outlier.)

`MPI_Sendrecv()`

We should also get these optimizations into MPI_Sendrecv() in order to speed up the HPCC "ring" results. E.g., here are latencies in µsecs for a performance measurement based on HPCC "ring" tests.

==================================================
np=64
                                            natural random

"out of box"                                   11.7   10.9
fast path                                       8.3    6.2
fast path and 100 warmups                       3.5    3.6
==================================================
np=128 latency
                                            natural random

"out of box"                                  242.9  226.1
fast path                                      56.6   37.0
fast path and 100 warmups                       4.2    4.1
==================================================

There happen to be two problems here:

We need fast-path optimizations in MPI_Sendrecv() for improved performance.
The MPI collective operation preceding the "ring" measurement has "ragged" exit times. So, the "ring" timing starts well before all of the processes have entered that measurement. This is a separate OMPI performance problem that must be handled as well for good HPCC results.

Open Issues

Here are some open issues:

Side effects. Should the sendi and recvi functions leave any side effects if they do not complete the specified operation?
To my taste, they should not.
Currently, however, the sendi function is expected to allocate a descriptor if it can, even if it cannot complete the entire send operation.
recvi: BTL and match header. An in-coming message starts with a "match header", with such data as MPI source rank, MPI communicator, and MPI tag for performing MPI message matching. Presumably, the BTL knows nothing about this header. Message matching is performed, for example, via PML callback functions. We are aggressively trying to optimize this code path, however, so we should consider alternatives to that approach.
One alternative is simply for the BTL to perform a byte-by-byte comparison between the received header and the specified header. The PML already tells the BTL how many bytes are in the header.
One problem with this approach is that the fast path would not be able to support the wildcard tag MPI_ANY_TAG.
Further, it leaves open the question how one extracts information (such as source or tag) from this header for the MPI_Status structure.
We can imagine a variety of solutions here, but so far we've implemented a very simple (even if architecturally distasteful) solution: we hardwire information (previously private to the PML) about the match header into the BTL.
That approach can be replaced with other solutions.
MPI_Sendrecv() support. As discussed earlier, we should support fast-path optimizations for "immediate" send-receive operations. Again, this may entail some movement of current OMPI architectural boundaries.

Other optimizations that are needed for good HPCC results include:

reducing the degradation due to hard spin waits
improving the performance of collective operations (which "artificially" degrade HPCC "ring" test results)

[OMPI devel] RFC: sm Latency

RFC: sm Latency

WHY (details)

WHAT (details)

1. Invoke the sm BTL sendi (send-immediate) function

2. Move the sendi call up higher in the PML

3. Introduce a BTL recvi call

4. Introduce an "immediate" data convertor

5. Introduce an "immediate" MPI_Sendrecv()

Performance Measurements: Before Optimization

OMPI versus Other MPIs

Scaling with Process Count

Performance Measurements: After Optimization

V20z

F6900

M9000

MPI_Sendrecv()

Open Issues

Reply via email to

RFC: `sm` Latency

1. Invoke the `sm` BTL `sendi` (send-immediate) function

2. Move the `sendi` call up higher in the PML

3. Introduce a BTL `recvi` call

5. Introduce an "immediate" `MPI_Sendrecv()`

`MPI_Sendrecv()`