If all write to the same destination at the same time - yes. On older systems you could start to see drgradation around 6 procs, but things heald up ok further out. My guess is that you want one such queue per n procs, where n might be 8 (have to experiment), so polling costs are low and memory contention is manageble.
Rich ----- Original Message ----- From: devel-boun...@open-mpi.org <devel-boun...@open-mpi.org> To: Open MPI Developers <de...@open-mpi.org> Sent: Tue Jan 20 06:56:53 2009 Subject: Re: [OMPI devel] RFC: sm Latency Richard Graham wrote: > First, the performance improvements look really nice. > A few questions: > - How much of an abstraction violation does this introduce ? This > looks like the btl needs to start “knowing” about MPI level semantics. > Currently, the btl purposefully is ulp agnostic. I ask for 2 reasons > - you mention having the btl look at the match header (if I understood > correctly) > - not clear to me what you mean by returning the header to the list if > the irecv does not complete. If it does not complete, why not just > pass the header back for further processing, if all this is happening > at the pml level ? > - The measurements seem to be very dual process specific. Have you > looked at the impact of these changes on other applications at the > same process count ? “Real” apps would be interesting, but even hpl > would be a good start. > The current sm implementation is aimed only at small smp node count, > which was really the only relevant type of systems when this code was > written 5 years ago. For large core counts there is a rather simple > change that could be put in that is simple to implement, and will give > you flat scaling for the sort of tests you are running. If you replace > the fifo’s with a single link list per process in shared memory, with > senders to this process adding match envelopes atomically, with each > process reading its own link list (multiple writers and single reader > in non-threaded situation) there will be only one place to poll, > regardless of the number of procs involved in the run. One still needs > other optimizations to lower the absolute latency – perhaps what you > have suggested. If one really has all N procs trying to write to the > same fifo at once, performance will stink because of contention, but > most apps don’t have that behaviour. If I remember correctly you can get a slow down with method you mention above even with a handful (4-6 processes) writing to the same destination. --td > Rich > > > On 1/17/09 1:48 AM, "Eugene Loh" <eugene....@sun.com> wrote: > > > > ------------------------------------------------------------------------ > *RFC: **sm Latency > WHAT:* Introducing optimizations to reduce ping-pong latencies > over the sm BTL. > > *WHY:* This is a visible benchmark of MPI performance. We can > improve shared-memory latencies from 30% (if hardware latency is > the limiting factor) to 2× or more (if MPI software overhead is > the limiting factor). At high process counts, the improvement can > be 10× or more. > > *WHERE:* Somewhat in the sm BTL, but very importantly also in the > PML. Changes can be seen in ssh://www.open-mpi.org/~tdd/hg/fastpath. > > *WHEN:* Upon acceptance. In time for OMPI 1.4. > > *TIMEOUT:* February 6, 2009. > ------------------------------------------------------------------------ > This RFC is being submitted by eugene....@sun.com. > *WHY (details) > *The sm BTL typically has the lowest hardware latencies of any > BTL. Therefore, any OMPI software overhead we otherwise tolerate > becomes glaringly obvious in sm latency measurements. > > In particular, MPI pingpong latencies are oft-cited performance > benchmarks, popular indications of the quality of an MPI > implementation. Competitive vendor MPIs optimize this metric > aggressively, both for np=2 pingpongs and for pairwise pingpongs > for high np (like the popular HPCC performance test suite). > > Performance reported by HPCC include: > > * MPI_Send()/MPI_Recv() pingpong latency. > * MPI_Send()/MPI_Recv() pingpong latency as the number of > connections grows. > * MPI_Sendrecv() latency. > > The slowdown of latency as the number of sm connections grows > becomes increasingly important on large SMPs and ever more > prevalent many-core nodes. > > Other MPI implementations, such as Scali and Sun HPC ClusterTools > 6, introduced such optimizations years ago. > > Performance measurements indicate that the speedups we can expect > in OMPI with these optimizations range from 30% (np=2 measurements > where hardware is the bottleneck) to 2× (np=2 measurements where > software is the bottleneck) to over 10× (large np). > *WHAT (details) > *Introduce an optimized "fast path" for "immediate" sends and > receives. Several actions are recommended here. > *1. Invoke the **sm BTL sendi (send-immediate) function > *Each BTL is allowed to define a "send immediate" (sendi) > function. A BTL is not required to do so, however, in which case > the PML calls the standard BTL send function. > > A sendi function has already been written for sm, but it has not > been used due to insufficient testing. > > The function should be reviewed, commented in, tested, and used. > > The changes are: > > * *File*: ompi/mca/btl/sm/btl_sm.c > * *Declaration/Definition*: mca_btl_sm > * > > * > > > * Comment in the mca_btl_sm_sendi symbol instead of the NULL > placeholder so that the already existing sendi function will > be discovered and used by the PML. > * > > * *Function*: mca_btl_sm_sendi() > * > > * > > > * Review the existing sm sendi code. My suggestions include: > o Drop the test against the eager limit since the PML > calls this function only when the eager limit is > respected. > o Make sure the function has no side effects in the case > where it does not complete. See Open Issues > <#OpenIssues> , the final section of this document, > for further discussion of "side effects". > * > > * > > > * Mostly, I have reviewed the code and believe it's already > suitable for use. > > *2. Move the **sendi call up higher in the PML > *Profiling pingpong tests, we find that not so much time is > spent in the sm BTL. Rather, the PML consumes a lot of time > preparing a "send request". While these complex data > structures are needed to track progress of a long message > that will be sent in multiple chunks and progressed over > multiple entries to and exits from the MPI library, managing > this large data structure for an "immediate" send (one > chunk, one call) is overkill. Latency can be reduced > noticeably if one bypasses this data structure. This means > invoking the sendi function as early as possible in the PML. > > The changes are: > o *File*: ompi/mca/pml/ob1/pml_ob1_isend.c > o *Function*: mca_pml_ob1_send() > o > > o > > > o As soon as we enter the PML send function, try to call > the BTL sendi function. If this fails for whatever > reason, continue with the traditional PML send code > path. If it succeeds, then exit the PML and return up > to the calling layer without having to have wrestled > with the PML send-request data structure. > o > > o For better software management, the attempt to find > and use a BTL sendi function can be organized into a > new mca_pml_ob1_sendi() function. > o *File*: ompi/mca/pml/ob1/pml_ob1_sendreq.c > o *Function*: mca_pml_ob1_send_request_start_copy() > o > > o > > > o Remove this attempt to call the BTL sendi function, > since we've already tried to do so higher up in the PML. > *3. Introduce a BTL **recvi call > *While optimizing the send side of a pingpong > operation is helpful, it is less than half the job. At > least as many savings are possible on the receive side. > > Corresponding to what we've done on the send side, on > the receive side we can attempt, as soon as we've > entered the PML, to call a BTL recvi > (receive-immediate) function, bypassing the creation > of a complex "receive request" data structure that is > not needed if the receive can be completed immediately. > > Further, we can perform directed polling. OMPI > pingpong latencies grow significantly as the number of > sm connections increases, while competitors (Scali, in > any case) show absolutely flat latencies with > increasing np. The recvi function could check one > connection for the specified receive and exit quickly > if that message if found. > > A BTL is granted considerable latitude in the proposed > recvi functions. The principle requirement is that the > recvi /either/ completes the specified receive > completely /or else/ behaves as if the function was > not called at all. (That is, one should be able to > revert to the traditional code path without having to > worry about any recvi side effects. So, for example, > if the recvi function encounters any fragments being > returned to the process, it is permitted to return > those fragments to the free list.) > > While those are the "hard requirements" for recvi, > there are also some loose guidelines. Mostly, it is > understood that recvi should return "quickly" (a loose > term to be interpreted by the BTL). If recvi can > quickly complete the specified receive, great! If not, > it should return control to the PML, who can then > execute the traditional code path, which can handle > long messages (multiple chunks, multiple entries into > the MPI library) and execute other "progress" functions. > > The changes are: > + *File*: ompi/mca/btl/btl.h > + > > + > > > + In this file, we add a typedef declaration for > what a generic recvi should look like: > + > > + typedef int (*mca_btl_base_module_recvi_fn_t)(); > + > + > > + > > > + We also add a btl_recvi field so that a BTL can > register its recvi function, if any. > + *File*: > + ompi/mca/btl/elan/btl_elan.c > + ompi/mca/btl/gm/btl_gm.c > + ompi/mca/btl/mx/btl_mx.c > + ompi/mca/btl/ofud/btl_ofud.c > + ompi/mca/btl/openib/btl_openib.c > + ompi/mca/btl/portals/btl_portals.c > + ompi/mca/btl/sctp/btl_sctp.c > + ompi/mca/btl/self/btl_self.c > + ompi/mca/btl/sm/btl_sm.c > + ompi/mca/btl/tcp/btl_tcp.c > + ompi/mca/btl/template/btl_template.c > + ompi/mca/btl/udapl/btl_udapl.c > + > > + > > > + Each BTL must add a recvi field to its module. > In most cases, BTLs will not define a recvi > function, and the field will be set to NULL. > + *File*: ompi/mca/btl/sm/btl_sm.c > + *Function*: mca_btl_sm_recvi() > + > > + > > > + For the sm BTL, we set the field to the name of > the BTL's recvi function: mca_btl_sm_recvi. We > also add code to define the behavior of the > function. > + *File*: ompi/mca/btl/sm/btl_sm.h > + *Prototype*: mca_btl_sm_recvi() > + > > + > > > + We also add a prototype for the new function. > + *File*: ompi/mca/pml/ob1/pml_ob1_irecv.c > + *Function*: mca_pml_ob1_recv() > + > > + > > > + As soon as we enter the PML, we try to find and > use a BTL's recvi function. If we succeed, we > can exit the PML without having had invoked the > heavy-duty PML receive-request data structure. > If we fail, we simply revert to the traditional > PML receive code path, without having to worry > about any side effects that the failed recvi > might have left. > + > > + It is helpful to contain the recvi attempt in a > new mca_pml_ob1_recvi() function, which we add. > + *File*: ompi/class/ompi_fifo.h > + *Function*: ompi_fifo_probe_tail() > + > > + > > > + We don't want recvi to leave any side effects if > it encounters a message it is not prepared to > handle. Therefore, we need to be able to see > what is on a FIFO without popping that entry off > the FIFO. Therefore, we add this new function > that probes the FIFO without disturbing it. > *4. Introduce an "immediate" data convertor > *One of our aims here is to reduce latency by > bypassing expensive PML send and receive request > data structures. Again, these structures are > useful when we intend to complete a message over > multiple chunks and multiple MPI library > invocations, but are overkill for a message that > can be completed all at once. > > The same is true of data convertors. Convertors > pack user data into shared-memory buffers or > unpack them on the receive side. Convertors > allow a message to be sent in multiple chunks, > over the course of multiple unrelated MPI calls, > and for noncontiguous datatypes. These > sophisticated data structures are overkill in > some important cases, such as messages that are > handled in a single chunk and in a single MPI > call and consist of a single contiguous block of > data. > > While data convertors are not typically too > expensive, for shared-memory latency, where all > other costs have been pared back to a minimum, > convertors become noticeable -- around 10%. > > Therefore, we recognize special cases where we > can have barebones, minimal, data convertors. In > these cases, we initialize the convertor > structure minimally -- e.g., a buffer address, a > number of bytes to copy, and a flag indicating > that all other fields are uninitialized. If this > is not possible (e.g., because a non-contiguous > user-derived datatype is being used), the > "immediate" send or receive uses data convertors > normally. > > The changes are: > # *File*: ompi/datatype/convertor.h > # > > # > > > # First, we add to the convertor flags a new > flag > # > > # #define CONVERTOR_IMMEDIATE 0x10000000 > # > # to identify a data convertor that has been > initialized only minimally. > # > > # Further, we add three new functions: > * ompi_convertor_immediate(): try to > form an "immediate" convertor > * ompi_convertor_immediate_pack(): use > an "immediate" convertor to pack > * ompi_convertor_immediate_unpack(): > use an "immediate" convertor to unpack > # *File*: ompi/mca/btl/sm/btl_sm.c > # *Function*: mca_btl_sm_sendi and > mca_btl_sm_recvi > # > > # > > > # Use the "immediate" convertor routines to > pack/unpack. > # *File*: ompi/mca/pml/ob1/pml_ob1_isend.c > and ompi/mca/pml/ob1/pml_ob1_irecv.c > # > > # > > > # Have the PML fast path try to construct an > "immediate" convertor. > *5. Introduce an "immediate" **MPI_Sendrecv() > *The optimizations described here should > be extended to MPI_Sendrecv() operations. > In particular, while MPI_Send() and > MPI_Recv() optimizations improve HPCC > "pingpong" latencies, we need > MPI_Sendrecv() optimizations to improve > HPCC "ring" latencies. > > One challenge is the current OMPI MPI/PML > interface. Today, the OMPI MPI layer > breaks a Sendrecv call up into > Irecv/Send/Wait. This would seem to defeat > fast-path optimizations at least for the > receive. Some options include: > * allow the MPI layer to call "fast > path" operations > * have the PML layer provide a > Sendrecv interface > * have the MPI layer emit > Isend/Recv/Wait and see how > effectively one can optimize the > Isend operation in the PML for the > "immediate" case > *Performance Measurements: Before Optimization > *More measurements are desirable, but here > is a sampling of data that I happen to > have from platforms that I happened to > have access to. This data characterizes > OMPI today, without fast-path optimizations. > *OMPI versus Other MPIs > *Here are pingpong latencies, in μsec, > measured with the OSU latency test for 0 > and 8 bytes. > > 0-byte 8-byte > > OMPI 0.74 0.84 μsec > MPICH 0.70 0.77 > We see OMPI lagging MPICH. > > Scali and HP MPI are presumably > /considerably/ faster, but I have no > recent data. > > Among other things, one can see that there > is about a 10% penalty for invoking data > convertors. > *Scaling with Process Count > *Here are HPCC pingpong latencies from a > different, older, platform. Though only > two processes participate in the pingpong, > the HPCC test reports that latency for > different numbers of processes in the job. > We see that OMPI performance slows > dramatically as the number of processes is > increased. Scali (data not available) does > not show such a slowdown. > > np min avg max > > 2 2.688 2.719 2.750 usec > 4 2.812 2.875 3.000 > 6 2.875 3.050 3.250 > 8 2.875 3.299 3.625 > 10 2.875 3.447 3.812 > 12 3.063 3.687 4.375 > 16 2.687 4.093 5.063 > 20 2.812 4.492 6.000 > 24 3.125 5.026 6.562 > 28 3.250 5.326 7.250 > 32 3.500 5.830 8.375 > 36 3.750 6.199 8.938 > 40 4.062 6.753 10.187 > The data show large min-max variations in > latency. These variations happen to depend > on sender and receiver ranks. Here are > latencies (rounded down to the nearst > μsec) for the np=40 case as a function of > sender and receiver rank: > > --------- rank of one process -----------> > > - 9 9 9 9 9 9 9 9 9 9 9 9 9 8 8 7 7 7 7 7 > 6 7 8 7 7 7 7 7 6 7 7 7 6 7 7 7 7 6 7 > 9 - 9 9 9 9 9 9 9 9 8 8 8 8 8 8 7 7 7 7 8 > 7 7 7 7 7 6 7 7 7 7 7 6 7 6 7 7 7 7 7 > 9 9 - 9 9 9 9 9 9 9 8 9 7 7 7 8 9 7 7 7 7 > 7 7 7 7 7 6 7 8 6 7 7 7 7 7 7 6 7 7 6 > 9 910 - 9 9 9 8 8 8 7 9 7 8 7 7 7 8 8 7 7 > 8 7 7 6 7 7 7 7 7 6 6 7 6 7 7 7 7 7 7 > 9 9 9 9 - 9 9 9 8 8 8 7 7 8 7 8 8 8 7 7 7 > 8 8 7 6 6 7 8 7 7 6 6 7 7 6 7 7 6 7 7 > 9 9 9 9 9 - 9 9 9 8 7 7 8 8 8 7 8 7 7 8 8 > 6 7 7 6 7 7 7 7 6 6 6 7 7 7 7 6 6 6 6 > 9 9 9 9 9 9 - 9 9 8 9 8 8 8 7 8 8 7 8 6 7 > 7 7 7 7 7 6 6 7 7 6 7 6 7 6 7 7 6 7 6 > 9 9 9 9 9 9 9 - 9 8 8 8 8 9 8 7 8 7 8 7 7 > 6 7 7 7 7 7 6 7 7 7 7 7 7 7 7 6 7 7 7 > 9 9 8 9 9 8 8 9 - 7 9 9 9 7 7 7 8 8 8 7 7 > 7 6 7 7 7 6 7 6 6 6 6 7 6 7 6 6 6 7 6 > 9 9 9 9 7 7 8 8 8 - 8 9 8 7 7 7 8 7 7 7 7 > 7 7 7 7 7 6 6 7 6 7 6 7 7 6 7 7 6 6 6 > 9 9 9 9 9 8 9 9 7 9 - 8 7 8 7 7 6 8 7 7 7 > 6 7 7 7 7 7 7 6 6 6 6 7 7 7 6 6 7 7 6 > | 9 8 8 9 8 7 8 8 8 8 7 - 9 7 7 8 7 7 7 7 > 7 7 7 6 6 6 7 6 7 6 6 6 7 7 6 6 7 6 7 5 > | 8 8 9 8 9 7 7 8 8 7 7 8 - 7 8 9 8 7 7 7 > 6 6 7 7 7 7 7 6 7 6 7 7 7 6 7 6 6 6 6 6 > | 8 8 8 8 8 9 7 8 8 7 7 7 7 - 8 8 8 8 7 7 > 7 6 7 7 7 6 6 6 6 7 7 7 7 6 6 6 6 6 5 6 > | 6 7 9 9 9 7 7 8 7 7 8 7 8 7 - 6 8 7 7 7 > 8 7 7 7 7 6 6 7 7 7 6 7 6 7 7 6 6 6 4 5 > | 7 7 6 8 7 8 8 8 7 7 8 7 8 9 7 - 7 7 7 8 > 7 7 6 7 7 7 7 6 7 6 7 6 6 6 6 6 6 5 5 5 > 7 9 7 8 9 7 8 7 8 8 8 7 7 7 7 7 - 7 8 7 8 > 7 7 7 7 7 6 7 7 6 6 7 6 6 6 4 5 5 5 5 > rank 8 8 8 7 9 7 8 7 7 7 8 7 7 7 7 7 8 - 7 > 7 7 7 7 7 7 6 7 7 7 6 6 7 7 6 6 6 6 5 4 5 > of 7 7 7 8 6 8 6 7 8 7 6 7 7 7 7 7 7 7 - 7 > 7 7 7 7 7 6 6 7 6 6 6 6 6 6 6 6 6 5 5 4 > the 8 7 8 8 7 8 8 7 7 7 7 7 7 7 7 7 7 7 8 > - 7 7 7 7 7 7 7 6 7 6 6 6 6 5 5 5 5 5 4 4 > other 8 7 7 8 7 7 7 7 8 7 7 7 8 7 7 7 7 7 > 7 7 - 7 7 6 7 7 7 7 6 6 7 6 6 6 5 5 5 5 5 5 > process 7 6 6 7 7 7 8 7 7 6 6 7 6 7 6 7 8 > 7 7 8 7 - 7 7 7 7 7 7 6 6 6 6 6 5 5 5 4 4 4 4 > 7 8 7 7 7 7 7 7 8 8 7 7 7 7 7 6 7 6 7 7 7 > 7 - 7 7 7 7 6 6 6 4 5 5 6 4 4 4 6 5 5 > | 7 6 7 7 7 6 6 7 6 8 7 8 7 7 7 7 7 7 7 7 > 7 7 7 - 7 6 6 6 6 5 5 5 6 5 4 4 5 5 4 4 > | 7 7 7 6 7 7 7 7 8 7 6 7 6 6 7 6 6 6 6 7 > 6 7 7 7 - 6 6 6 5 5 5 5 5 4 4 5 6 4 5 4 > | 6 7 7 7 7 7 7 7 8 8 8 7 7 7 6 7 7 7 6 6 > 7 7 7 6 5 - 6 5 6 6 5 5 5 4 5 5 5 4 4 4 > | 7 7 6 7 7 7 7 8 7 7 7 7 6 7 7 7 7 7 6 7 > 6 6 6 5 5 4 - 5 5 5 4 5 5 5 4 5 5 4 4 4 > | 7 7 7 8 7 6 7 6 7 7 7 7 7 6 7 7 7 7 6 6 > 6 6 6 4 6 4 5 - 5 4 4 5 4 4 5 5 5 4 4 4 > V 7 6 8 7 7 6 6 7 6 7 7 7 7 7 6 7 7 6 6 6 > 7 6 6 5 6 5 5 4 - 4 5 5 4 4 4 4 4 4 4 5 > 6 6 6 6 6 6 7 8 7 6 7 7 7 7 6 6 7 6 6 5 5 > 6 6 5 5 6 5 5 4 - 5 4 4 4 4 4 4 6 4 4 > 6 6 6 7 6 7 7 7 7 6 7 7 6 6 7 7 7 6 6 6 6 > 6 5 4 4 4 5 4 4 4 - 5 5 4 4 4 4 4 4 4 > 7 6 7 6 6 6 7 7 7 6 7 7 6 6 6 7 6 6 6 5 6 > 5 5 5 5 4 4 4 5 5 6 - 4 4 4 4 4 4 4 4 > 7 7 6 6 6 6 6 7 7 7 6 7 6 7 7 7 6 6 5 5 4 > 5 5 4 4 4 4 5 4 4 5 4 - 4 4 4 5 4 4 4 > 7 6 7 6 6 6 6 6 7 7 7 7 6 7 6 6 6 6 6 5 5 > 4 5 4 4 4 4 4 4 4 4 4 4 - 5 4 4 4 4 5 > 7 6 7 7 7 8 7 7 6 6 6 7 6 6 6 6 5 5 4 5 5 > 5 4 4 5 4 4 4 4 4 4 4 4 4 - 4 4 4 4 4 > 7 6 7 6 7 6 6 6 6 6 7 7 6 6 6 6 5 5 5 4 4 > 4 4 4 5 4 4 4 4 4 4 4 4 4 4 - 4 4 4 4 > 8 6 6 7 7 7 7 8 7 6 6 7 6 6 6 6 5 4 5 4 5 > 5 4 5 4 4 5 4 4 4 4 5 5 4 4 4 - 4 4 4 > 7 7 7 6 7 7 6 7 6 6 7 6 6 6 6 5 4 5 4 5 4 > 4 4 4 4 4 4 4 4 5 4 4 4 4 4 4 4 - 4 4 > 7 7 7 7 7 6 7 7 6 7 7 7 7 5 4 5 5 4 5 5 4 > 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 - 4 > 7 6 7 7 6 7 6 6 6 6 6 6 6 6 5 5 6 4 4 5 4 > 5 4 5 4 4 4 4 5 4 4 4 5 4 4 4 4 4 4 - > We see that there is a strong dependence > on process rank. Presumably, this is due > to our polling loop. That is, even if we > receive our message, we still have to poll > the higher numbered ranks before we > complete the receive operation. > *Performance Measurements: After Optimization > *We consider three metrics: > * HPCC "pingpong" latency > * OSU latency (0 bytes) > * OSU latency (8 bytes) > We report data for: > * OMPI "out of the box" > * after implementation of steps 1-2 > (send side) > * after implementation of steps 1-3 > (send and receive sides) > * after implementation of steps 1-4 > (send and receive sides, plus data > convertor) > The data are from machines that I just > happened to have available. > > There is a bit of noise in these results, > but the implications, based on these and > other measurements, are: > * There is some improvement from the > send side. > * There is more improvement from the > receive side. > * The data convertor improvements help > a little more (a few percent) for > non-null messages. > * The degree of improvement depends on > how fast the CPU is relative to the > memory -- that is, how important > software overheads are versus > hardware latency. > o If the CPU is fast (and > hardware latency is the > bottleneck), these > improvements are less -- say, > 20-30%. > o If the CPU is slow (and > software costs are the > bottleneck), the improvements > are more dramatic -- nearly a > factor of 2 for non-null > messages. > * As np is increased, latency stays > flat. This can represent a 10× or > more improvement over out-of-the-box > OMPI. > *V20z > *Here are results for a V20z > (burl-ct-v20z-11): > > HPCC OSU0 OSU8 > > out of box 838 770 850 nsec > Steps 1-2 862 770 860 > Steps 1-3 670 610 670 > Steps 1-4 642 580 610 > *F6900 > *Here are np=2 results from a 1.05-GHz > (1.2?) UltraSPARC-IV F6900 server: > > HPCC OSU0 OSU8 > > out of box 3430 2770 3340 nsec > Steps 1-2 2940 2660 3090 > Steps 1-3 1854 1650 1880 > Steps 1-4 1660 1640 1750 > Here is the dependence on process count > using HPCC: > > OMPI > "out of the box" optimized > comm ----------------- ----------------- > size min avg max min avg max > > 2 2688 2719 2750 1750 1781 1812 nsec > 4 2812 2875 3000 1750 1802 1812 > 6 2875 3050 3250 1687 1777 1812 > 8 2875 3299 3625 1687 1773 1812 > 10 2875 3447 3812 1687 1789 1812 > 12 3063 3687 4375 1687 1796 1813 > 16 2687 4093 5063 1500 1784 1875 > 20 2812 4492 6000 1687 1788 1875 > 24 3125 5026 6562 1562 1776 1875 > 28 3250 5326 7250 1500 1764 1813 > 32 3500 5830 8375 1562 1755 1875 > 36 3750 6199 8938 1562 1755 1875 > 40 4062 6753 10187 1500 1742 1812 > Note: > * At np=2, these optimizations lead to > a 2× improvement in shared-memory > latency. > * Non-null messages incur more than a > 10% penalty, which is largely > addressed by our data-convertor > optimization. > * At larger np, we maintain our fast > performance while OMPI "out of the > box" keeps slowing down more and > more and more. > *M9000 > *Here are results for a 128-core M9000. I > think the system has: > * 2 hardware threads per core (but we > only use one hardware thread per core) > * 4 cores per socket > * 4 sockets per board > * 4 boards per (half?) > * 2 (halves?) per system > As one separates the sender and receiver, > hardware latency increases. Here is the > hierarchy: > > latency (nsec) bandwidth (Mbyte/sec) > out-of-box fastpath out-of-box fastpath > (on-socket?) 810 480 2000 2000 > (on-board?) 2050 1820 1900 1900 > (half?) 3030 2840 1680 1680 > 3150 2960 1660 1660 > Note: > * Latency benefits some hundreds of > nsecs with fastpath. > * This latency improvement is striking > when the hardware latency is small, > but less noticeable as as the > hardware latency increases. > * Bandwidth is not very sensitive to > hardware latency (due to prefetch) > and not at all to fast-path > optimizations. > Here are HPCC pingpong latencies for > increasing process counts: > > out-of-box fastpath > np ----------------- ----------------- > min avg max min avg max > > 2 812 812 812 499 499 499 > 4 874 921 999 437 494 562 > 8 937 1847 2624 437 1249 1874 > 16 1062 2430 2937 437 1557 1937 > 32 1562 3850 5437 375 2211 2875 > 64 2687 8329 15874 437 2535 3062 > 80 3499 16854 41749 374 2647 3437 > 96 3812 31159 100812 374 2717 3437 > 128 5187 125774 335187 437 2793 3499 > The improvements are tremendous: > * At low np, latencies are low since > sender and receiver can be > colocated. Nevertheless, fast-path > optimizations provided a nearly 2× > improvement. > * As np increases, fast-path latency > also increases, but this is due to > higher hardware latencies. Indeed, > the "min" numbers even drop a > little. The "max" fast-path numbers > basically only represent the > increase in hardware latency. > * As np increases, OMPI "out of the > box" latency suffers > catastrophically. Not only is there > the issue of more connections to > poll, but the polling behaviors of > non-participating processes wreak > havoc on the performance of measured > processes. > * > > * We can separate the two sources of > latency degradation by putting the > np-2 non-participating processes to > sleep. In that case, latency only > rises to about 10-20 μsec. So, > polling of many connections causes a > substantial rise in latency, while > the disturbance of hard-poll loops > on system performance is responsible > for even more degradation. > Actually, even bandwidth benefits: > > out-of-box fastpath > np -------------- ------------- > min avg max min avg max > > 2 2015 2034 2053 2028 2039 2051 > 4 2002 2043 2077 1993 2032 2065 > 8 1888 1959 2035 1897 1969 2088 > 16 1863 1934 2046 1856 1937 2066 > 32 1626 1796 2038 1581 1798 2068 > 64 1557 1709 1969 1591 1729 2084 > 80 1439 1619 1902 1561 1706 2059 > 96 1281 1452 1722 1500 1689 2005 > 128 677 835 1276 893 1671 1906 > Here, we see that even bandwidth > suffers "out of the box" as the > number of hard-spinning processes > increases. Note the degradation in > "out-of-box" average bandwidths as > np increases. In contrast, the > "fastpath" average holds up well. > (The np=128 min fastpath number 893 > Mbyte/sec is poor, but analysis > shows it to be a measurement outlier.) > *MPI_Sendrecv() > *We should also get these > optimizations into MPI_Sendrecv() in > order to speed up the HPCC "ring" > results. E.g., here are latencies in > μsecs for a performance measurement > based on HPCC "ring" tests. > > > ================================================== > np=64 > natural random > > "out of box" 11.7 10.9 > fast path 8.3 6.2 > fast path and 100 warmups 3.5 3.6 > > ================================================== > np=128 latency > natural random > > "out of box" 242.9 226.1 > fast path 56.6 37.0 > fast path and 100 warmups 4.2 4.1 > > ================================================== > There happen to be two problems here: > o We need fast-path > optimizations in > MPI_Sendrecv() for improved > performance. > o The MPI collective operation > preceding the "ring" > measurement has "ragged" exit > times. So, the "ring" timing > starts well before all of the > processes have entered that > measurement. This is a > separate OMPI performance > problem that must be handled > as well for good HPCC results. > *Open Issues > *Here are some open issues: > o *Side effects*. Should the > sendi and recvi functions > leave any side effects if they > do not complete the specified > operation? > o > > o To my taste, they should not. > o > > o Currently, however, the sendi > function is expected to > allocate a descriptor if it > can, even if it cannot > complete the entire send > operation. > o *recvi**: BTL and match > header*. An in-coming message > starts with a "match header", > with such data as MPI source > rank, MPI communicator, and > MPI tag for performing MPI > message matching. Presumably, > the BTL knows nothing about > this header. Message matching > is performed, for example, via > PML callback functions. We are > aggressively trying to > optimize this code path, > however, so we should consider > alternatives to that approach. > o > > o One alternative is simply for > the BTL to perform a > byte-by-byte comparison > between the received header > and the specified header. The > PML already tells the BTL how > many bytes are in the header. > o > > o One problem with this approach > is that the fast path would > not be able to support the > wildcard tag MPI_ANY_TAG. > o > > o Further, it leaves open the > question how one extracts > information (such as source or > tag) from this header for the > MPI_Status structure. > o > > o We can imagine a variety of > solutions here, but so far > we've implemented a very > simple (even if > architecturally distasteful) > solution: we hardwire > information (previously > private to the PML) about the > match header into the BTL. > o > > o That approach can be replaced > with other solutions. > o *MPI_Sendrecv()** support*. As > discussed earlier, we should > support fast-path > optimizations for "immediate" > send-receive operations. > Again, this may entail some > movement of current OMPI > architectural boundaries. > Other optimizations that are > needed for good HPCC results > include: > + reducing the degradation > due to hard spin waits > + improving the > performance of > collective operations > (which "artificially" > degrade HPCC "ring" test > results) > > ------------------------------------------------------------------------ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >