Title: RFC: sm Latency
The slowdown of latency as the number of sm connections grows becomes increasingly important on large SMPs and ever more prevalent many-core nodes. Other MPI implementations, such as Scali and Sun HPC ClusterTools 6, introduced such optimizations years ago. Performance measurements indicate that the speedups we can expect in OMPI with these optimizations range from 30% (np=2 measurements where hardware is the bottleneck) to 2× (np=2 measurements where software is the bottleneck) to over 10× (large np). WHAT (details)Introduce an optimized "fast path" for "immediate" sends and receives. Several actions are recommended here. 1. Invoke the sm BTL sendi (send-immediate) functionEach BTL is allowed to define a "send immediate" (sendi) function. A BTL is not required to do so, however, in which case the PML calls the standard BTL send function. A sendi function has already been written for sm, but it has not been used due to insufficient testing. The function should be reviewed, commented in, tested, and used. The changes are:
2. Move the sendi call up higher in the PMLProfiling pingpong tests, we find that not so much time is spent in the sm BTL. Rather, the PML consumes a lot of time preparing a "send request". While these complex data structures are needed to track progress of a long message that will be sent in multiple chunks and progressed over multiple entries to and exits from the MPI library, managing this large data structure for an "immediate" send (one chunk, one call) is overkill. Latency can be reduced noticeably if one bypasses this data structure. This means invoking the sendi function as early as possible in the PML. The changes are:
3. Introduce a BTL recvi callWhile optimizing the send side of a pingpong operation is helpful, it is less than half the job. At least as many savings are possible on the receive side. Corresponding to what we've done on the send side, on the receive side we can attempt, as soon as we've entered the PML, to call a BTL recvi (receive-immediate) function, bypassing the creation of a complex "receive request" data structure that is not needed if the receive can be completed immediately. Further, we can perform directed polling. OMPI pingpong latencies grow significantly as the number of sm connections increases, while competitors (Scali, in any case) show absolutely flat latencies with increasing np. The recvi function could check one connection for the specified receive and exit quickly if that message if found. A BTL is granted considerable latitude in the proposed recvi functions. The principle requirement is that the recvi either completes the specified receive completely or else behaves as if the function was not called at all. (That is, one should be able to revert to the traditional code path without having to worry about any recvi side effects. So, for example, if the recvi function encounters any fragments being returned to the process, it is permitted to return those fragments to the free list.) While those are the "hard requirements" for recvi, there are also some loose guidelines. Mostly, it is understood that recvi should return "quickly" (a loose term to be interpreted by the BTL). If recvi can quickly complete the specified receive, great! If not, it should return control to the PML, who can then execute the traditional code path, which can handle long messages (multiple chunks, multiple entries into the MPI library) and execute other "progress" functions. The changes are:
4. Introduce an "immediate" data convertorOne of our aims here is to reduce latency by bypassing expensive PML send and receive request data structures. Again, these structures are useful when we intend to complete a message over multiple chunks and multiple MPI library invocations, but are overkill for a message that can be completed all at once. The same is true of data convertors. Convertors pack user data into shared-memory buffers or unpack them on the receive side. Convertors allow a message to be sent in multiple chunks, over the course of multiple unrelated MPI calls, and for noncontiguous datatypes. These sophisticated data structures are overkill in some important cases, such as messages that are handled in a single chunk and in a single MPI call and consist of a single contiguous block of data. While data convertors are not typically too expensive, for shared-memory latency, where all other costs have been pared back to a minimum, convertors become noticeable -- around 10%. Therefore, we recognize special cases where we can have barebones, minimal, data convertors. In these cases, we initialize the convertor structure minimally -- e.g., a buffer address, a number of bytes to copy, and a flag indicating that all other fields are uninitialized. If this is not possible (e.g., because a non-contiguous user-derived datatype is being used), the "immediate" send or receive uses data convertors normally. The changes are:
5. Introduce an "immediate" MPI_Sendrecv()The optimizations described here should be extended to MPI_Sendrecv() operations. In particular, while MPI_Send() and MPI_Recv() optimizations improve HPCC "pingpong" latencies, we need MPI_Sendrecv() optimizations to improve HPCC "ring" latencies. One challenge is the current OMPI MPI/PML interface. Today, the OMPI MPI layer breaks a Sendrecv call up into Irecv/Send/Wait. This would seem to defeat fast-path optimizations at least for the receive. Some options include:
Performance Measurements: Before OptimizationMore measurements are desirable, but here is a sampling of data that I happen to have from platforms that I happened to have access to. This data characterizes OMPI today, without fast-path optimizations. OMPI versus Other MPIsHere are pingpong latencies, in µsec, measured with the OSU latency test for 0 and 8 bytes. 0-byte 8-byte OMPI 0.74 0.84 µsec MPICH 0.70 0.77 We see OMPI lagging MPICH. Scali and HP MPI are presumably considerably faster, but I have no recent data. Among other things, one can see that there is about a 10% penalty for invoking data convertors. Scaling with Process CountHere are HPCC pingpong latencies from a different, older, platform. Though only two processes participate in the pingpong, the HPCC test reports that latency for different numbers of processes in the job. We see that OMPI performance slows dramatically as the number of processes is increased. Scali (data not available) does not show such a slowdown. np min avg max 2 2.688 2.719 2.750 usec 4 2.812 2.875 3.000 6 2.875 3.050 3.250 8 2.875 3.299 3.625 10 2.875 3.447 3.812 12 3.063 3.687 4.375 16 2.687 4.093 5.063 20 2.812 4.492 6.000 24 3.125 5.026 6.562 28 3.250 5.326 7.250 32 3.500 5.830 8.375 36 3.750 6.199 8.938 40 4.062 6.753 10.187 The data show large min-max variations in latency. These variations happen to depend on sender and receiver ranks. Here are latencies (rounded down to the nearst µsec) for the np=40 case as a function of sender and receiver rank: --------- rank of one process -----------> - 9 9 9 9 9 9 9 9 9 9 9 9 9 8 8 7 7 7 7 7 6 7 8 7 7 7 7 7 6 7 7 7 6 7 7 7 7 6 7 9 - 9 9 9 9 9 9 9 9 8 8 8 8 8 8 7 7 7 7 8 7 7 7 7 7 6 7 7 7 7 7 6 7 6 7 7 7 7 7 9 9 - 9 9 9 9 9 9 9 8 9 7 7 7 8 9 7 7 7 7 7 7 7 7 7 6 7 8 6 7 7 7 7 7 7 6 7 7 6 9 910 - 9 9 9 8 8 8 7 9 7 8 7 7 7 8 8 7 7 8 7 7 6 7 7 7 7 7 6 6 7 6 7 7 7 7 7 7 9 9 9 9 - 9 9 9 8 8 8 7 7 8 7 8 8 8 7 7 7 8 8 7 6 6 7 8 7 7 6 6 7 7 6 7 7 6 7 7 9 9 9 9 9 - 9 9 9 8 7 7 8 8 8 7 8 7 7 8 8 6 7 7 6 7 7 7 7 6 6 6 7 7 7 7 6 6 6 6 9 9 9 9 9 9 - 9 9 8 9 8 8 8 7 8 8 7 8 6 7 7 7 7 7 7 6 6 7 7 6 7 6 7 6 7 7 6 7 6 9 9 9 9 9 9 9 - 9 8 8 8 8 9 8 7 8 7 8 7 7 6 7 7 7 7 7 6 7 7 7 7 7 7 7 7 6 7 7 7 9 9 8 9 9 8 8 9 - 7 9 9 9 7 7 7 8 8 8 7 7 7 6 7 7 7 6 7 6 6 6 6 7 6 7 6 6 6 7 6 9 9 9 9 7 7 8 8 8 - 8 9 8 7 7 7 8 7 7 7 7 7 7 7 7 7 6 6 7 6 7 6 7 7 6 7 7 6 6 6 9 9 9 9 9 8 9 9 7 9 - 8 7 8 7 7 6 8 7 7 7 6 7 7 7 7 7 7 6 6 6 6 7 7 7 6 6 7 7 6 | 9 8 8 9 8 7 8 8 8 8 7 - 9 7 7 8 7 7 7 7 7 7 7 6 6 6 7 6 7 6 6 6 7 7 6 6 7 6 7 5 | 8 8 9 8 9 7 7 8 8 7 7 8 - 7 8 9 8 7 7 7 6 6 7 7 7 7 7 6 7 6 7 7 7 6 7 6 6 6 6 6 | 8 8 8 8 8 9 7 8 8 7 7 7 7 - 8 8 8 8 7 7 7 6 7 7 7 6 6 6 6 7 7 7 7 6 6 6 6 6 5 6 | 6 7 9 9 9 7 7 8 7 7 8 7 8 7 - 6 8 7 7 7 8 7 7 7 7 6 6 7 7 7 6 7 6 7 7 6 6 6 4 5 | 7 7 6 8 7 8 8 8 7 7 8 7 8 9 7 - 7 7 7 8 7 7 6 7 7 7 7 6 7 6 7 6 6 6 6 6 6 5 5 5 7 9 7 8 9 7 8 7 8 8 8 7 7 7 7 7 - 7 8 7 8 7 7 7 7 7 6 7 7 6 6 7 6 6 6 4 5 5 5 5 rank 8 8 8 7 9 7 8 7 7 7 8 7 7 7 7 7 8 - 7 7 7 7 7 7 7 6 7 7 7 6 6 7 7 6 6 6 6 5 4 5 of 7 7 7 8 6 8 6 7 8 7 6 7 7 7 7 7 7 7 - 7 7 7 7 7 7 6 6 7 6 6 6 6 6 6 6 6 6 5 5 4 the 8 7 8 8 7 8 8 7 7 7 7 7 7 7 7 7 7 7 8 - 7 7 7 7 7 7 7 6 7 6 6 6 6 5 5 5 5 5 4 4 other 8 7 7 8 7 7 7 7 8 7 7 7 8 7 7 7 7 7 7 7 - 7 7 6 7 7 7 7 6 6 7 6 6 6 5 5 5 5 5 5 process 7 6 6 7 7 7 8 7 7 6 6 7 6 7 6 7 8 7 7 8 7 - 7 7 7 7 7 7 6 6 6 6 6 5 5 5 4 4 4 4 7 8 7 7 7 7 7 7 8 8 7 7 7 7 7 6 7 6 7 7 7 7 - 7 7 7 7 6 6 6 4 5 5 6 4 4 4 6 5 5 | 7 6 7 7 7 6 6 7 6 8 7 8 7 7 7 7 7 7 7 7 7 7 7 - 7 6 6 6 6 5 5 5 6 5 4 4 5 5 4 4 | 7 7 7 6 7 7 7 7 8 7 6 7 6 6 7 6 6 6 6 7 6 7 7 7 - 6 6 6 5 5 5 5 5 4 4 5 6 4 5 4 | 6 7 7 7 7 7 7 7 8 8 8 7 7 7 6 7 7 7 6 6 7 7 7 6 5 - 6 5 6 6 5 5 5 4 5 5 5 4 4 4 | 7 7 6 7 7 7 7 8 7 7 7 7 6 7 7 7 7 7 6 7 6 6 6 5 5 4 - 5 5 5 4 5 5 5 4 5 5 4 4 4 | 7 7 7 8 7 6 7 6 7 7 7 7 7 6 7 7 7 7 6 6 6 6 6 4 6 4 5 - 5 4 4 5 4 4 5 5 5 4 4 4 V 7 6 8 7 7 6 6 7 6 7 7 7 7 7 6 7 7 6 6 6 7 6 6 5 6 5 5 4 - 4 5 5 4 4 4 4 4 4 4 5 6 6 6 6 6 6 7 8 7 6 7 7 7 7 6 6 7 6 6 5 5 6 6 5 5 6 5 5 4 - 5 4 4 4 4 4 4 6 4 4 6 6 6 7 6 7 7 7 7 6 7 7 6 6 7 7 7 6 6 6 6 6 5 4 4 4 5 4 4 4 - 5 5 4 4 4 4 4 4 4 7 6 7 6 6 6 7 7 7 6 7 7 6 6 6 7 6 6 6 5 6 5 5 5 5 4 4 4 5 5 6 - 4 4 4 4 4 4 4 4 7 7 6 6 6 6 6 7 7 7 6 7 6 7 7 7 6 6 5 5 4 5 5 4 4 4 4 5 4 4 5 4 - 4 4 4 5 4 4 4 7 6 7 6 6 6 6 6 7 7 7 7 6 7 6 6 6 6 6 5 5 4 5 4 4 4 4 4 4 4 4 4 4 - 5 4 4 4 4 5 7 6 7 7 7 8 7 7 6 6 6 7 6 6 6 6 5 5 4 5 5 5 4 4 5 4 4 4 4 4 4 4 4 4 - 4 4 4 4 4 7 6 7 6 7 6 6 6 6 6 7 7 6 6 6 6 5 5 5 4 4 4 4 4 5 4 4 4 4 4 4 4 4 4 4 - 4 4 4 4 8 6 6 7 7 7 7 8 7 6 6 7 6 6 6 6 5 4 5 4 5 5 4 5 4 4 5 4 4 4 4 5 5 4 4 4 - 4 4 4 7 7 7 6 7 7 6 7 6 6 7 6 6 6 6 5 4 5 4 5 4 4 4 4 4 4 4 4 4 5 4 4 4 4 4 4 4 - 4 4 7 7 7 7 7 6 7 7 6 7 7 7 7 5 4 5 5 4 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 - 4 7 6 7 7 6 7 6 6 6 6 6 6 6 6 5 5 6 4 4 5 4 5 4 5 4 4 4 4 5 4 4 4 5 4 4 4 4 4 4 - We see that there is a strong dependence on process rank. Presumably, this is due to our polling loop. That is, even if we receive our message, we still have to poll the higher numbered ranks before we complete the receive operation. Performance Measurements: After OptimizationWe consider three metrics:
We report data for:
The data are from machines that I just happened to have available. There is a bit of noise in these results, but the implications, based on these and other measurements, are:
V20zHere are results for a V20z (burl-ct-v20z-11): HPCC OSU0 OSU8 out of box 838 770 850 nsec Steps 1-2 862 770 860 Steps 1-3 670 610 670 Steps 1-4 642 580 610 F6900Here are np=2 results from a 1.05-GHz (1.2?) UltraSPARC-IV F6900 server: HPCC OSU0 OSU8 out of box 3430 2770 3340 nsec Steps 1-2 2940 2660 3090 Steps 1-3 1854 1650 1880 Steps 1-4 1660 1640 1750 Here is the dependence on process count using HPCC: OMPI "out of the box" optimized comm ----------------- ----------------- size min avg max min avg max 2 2688 2719 2750 1750 1781 1812 nsec 4 2812 2875 3000 1750 1802 1812 6 2875 3050 3250 1687 1777 1812 8 2875 3299 3625 1687 1773 1812 10 2875 3447 3812 1687 1789 1812 12 3063 3687 4375 1687 1796 1813 16 2687 4093 5063 1500 1784 1875 20 2812 4492 6000 1687 1788 1875 24 3125 5026 6562 1562 1776 1875 28 3250 5326 7250 1500 1764 1813 32 3500 5830 8375 1562 1755 1875 36 3750 6199 8938 1562 1755 1875 40 4062 6753 10187 1500 1742 1812 Note:
M9000Here are results for a 128-core M9000. I think the system has:
As one separates the sender and receiver, hardware latency increases. Here is the hierarchy: latency (nsec) bandwidth (Mbyte/sec) out-of-box fastpath out-of-box fastpath (on-socket?) 810 480 2000 2000 (on-board?) 2050 1820 1900 1900 (half?) 3030 2840 1680 1680 3150 2960 1660 1660 Note:
Here are HPCC pingpong latencies for increasing process counts: out-of-box fastpath np ----------------- ----------------- min avg max min avg max 2 812 812 812 499 499 499 4 874 921 999 437 494 562 8 937 1847 2624 437 1249 1874 16 1062 2430 2937 437 1557 1937 32 1562 3850 5437 375 2211 2875 64 2687 8329 15874 437 2535 3062 80 3499 16854 41749 374 2647 3437 96 3812 31159 100812 374 2717 3437 128 5187 125774 335187 437 2793 3499 The improvements are tremendous:
Actually, even bandwidth benefits: out-of-box fastpath np -------------- ------------- min avg max min avg max 2 2015 2034 2053 2028 2039 2051 4 2002 2043 2077 1993 2032 2065 8 1888 1959 2035 1897 1969 2088 16 1863 1934 2046 1856 1937 2066 32 1626 1796 2038 1581 1798 2068 64 1557 1709 1969 1591 1729 2084 80 1439 1619 1902 1561 1706 2059 96 1281 1452 1722 1500 1689 2005 128 677 835 1276 893 1671 1906 Here, we see that even bandwidth suffers "out of the box" as the number of hard-spinning processes increases. Note the degradation in "out-of-box" average bandwidths as np increases. In contrast, the "fastpath" average holds up well. (The np=128 min fastpath number 893 Mbyte/sec is poor, but analysis shows it to be a measurement outlier.) MPI_Sendrecv()We should also get these optimizations into MPI_Sendrecv() in order to speed up the HPCC "ring" results. E.g., here are latencies in µsecs for a performance measurement based on HPCC "ring" tests. ================================================== np=64 natural random "out of box" 11.7 10.9 fast path 8.3 6.2 fast path and 100 warmups 3.5 3.6 ================================================== np=128 latency natural random "out of box" 242.9 226.1 fast path 56.6 37.0 fast path and 100 warmups 4.2 4.1 ================================================== There happen to be two problems here:
Open IssuesHere are some open issues:
Other optimizations that are needed for good HPCC results include:
|