Looks like it slowed down by about 20ns from the original patch. That is to be expected when latencies are this low. Results for the following are attached:
- Trunk r30215 sm and vader results for osu_latency. - Trunk r30215 + patch take3 for both sm and vader. - Trunk r30215 + patch + forced 16 byte match header for vader. The last one is not completely surprising. The current match header is 14 bytes which means the memcpy for the data is not aligned for a 64-bit architecture. Might be worth looking at bumping the match header size up as another optimization. -Nathan On Fri, Jan 10, 2014 at 02:24:19PM +0100, George Bosilca wrote: > Nathan, > > When you get access to the machine it might be interesting to show not only > the after-patch performance but also what the trunk is getting on the same > architecture. > > George. > > On Jan 8, 2014, at 18:09 , Nathan Hjelm <hje...@lanl.gov> wrote: > > > Yeah. Its hard to say what the results will look like on Haswell. I > > expect they should show some improvement from George's change but we > > won't know until I can get to a Haswell node. Hopefully one becomes > > available today. > > > > -Nathan > > > > On Wed, Jan 08, 2014 at 08:59:34AM -0800, Paul Hargrove wrote: > >> Nevermind, since Nathan just clarified that the results are not > >> comparable. > >> > >> -Paul [Sent from my phone] > >> > >> On Jan 8, 2014 8:58 AM, "Paul Hargrove" <phhargr...@lbl.gov> wrote: > >> > >> Interestingly enough the 4MB latency actually improved significantly > >> relative to the initial numbers. > >> > >> -Paul [Sent from my phone] > >> > >> On Jan 8, 2014 8:50 AM, "George Bosilca" <bosi...@icl.utk.edu> wrote: > >> > >> These results are way worst that the one you send on your previous > >> email? What is the reason? > >> > >> George. > >> > >> On Jan 8, 2014, at 17:33 , Nathan Hjelm <hje...@lanl.gov> wrote: > >> > >>> Ah, good catch. A new version is attached that should eliminate the > >> race > >>> window for the multi-threaded case. Performance numbers are still > >>> looking really good. We beat mvapich2 in the small message ping-pong > >> by > >>> a good margin. See the results below. The large message latency > >>> difference for large messages is probably due to a difference in the > >> max > >>> send size for vader vs mvapich. > >>> > >>> To answer Pasha's question. I don't see a noticiable difference in > >>> performance for btl's with no sendi function (this includes > >>> ugni). OpenIB should get a boost. I will test that once I get an > >>> allocation. > >>> > >>> CPU: Xeon E5-2670 @ 2.60 GHz > >>> > >>> Open MPI (-mca btl vader,self): > >>> # OSU MPI Latency Test v4.1 > >>> # Size Latency (us) > >>> 0 0.17 > >>> 1 0.19 > >>> 2 0.19 > >>> 4 0.19 > >>> 8 0.19 > >>> 16 0.19 > >>> 32 0.19 > >>> 64 0.40 > >>> 128 0.40 > >>> 256 0.43 > >>> 512 0.52 > >>> 1024 0.67 > >>> 2048 0.94 > >>> 4096 1.44 > >>> 8192 2.04 > >>> 16384 3.47 > >>> 32768 6.10 > >>> 65536 9.38 > >>> 131072 16.47 > >>> 262144 29.63 > >>> 524288 54.81 > >>> 1048576 106.63 > >>> 2097152 206.84 > >>> 4194304 421.26 > >>> > >>> > >>> mvapich2 1.9: > >>> # OSU MPI Latency Test > >>> # Size Latency (us) > >>> 0 0.23 > >>> 1 0.23 > >>> 2 0.23 > >>> 4 0.23 > >>> 8 0.23 > >>> 16 0.28 > >>> 32 0.28 > >>> 64 0.39 > >>> 128 0.40 > >>> 256 0.40 > >>> 512 0.42 > >>> 1024 0.51 > >>> 2048 0.71 > >>> 4096 1.02 > >>> 8192 1.60 > >>> 16384 3.47 > >>> 32768 5.05 > >>> 65536 8.06 > >>> 131072 14.82 > >>> 262144 28.15 > >>> 524288 53.69 > >>> 1048576 127.47 > >>> 2097152 235.58 > >>> 4194304 683.90 > >>> > >>> > >>> -Nathan > >>> > >>> On Tue, Jan 07, 2014 at 06:23:13PM -0700, George Bosilca wrote: > >>>> The local request is not correctly released, leading to assert in > >> debug > >>>> mode. This is because you avoid calling > >> MCA_PML_BASE_RECV_REQUEST_FINI, > >>>> fact that leaves the request in an ACTIVE state, condition > >> carefully > >>>> checked during the call to destructor. > >>>> > >>>> I attached a second patch that fixes the issue above, and > >> implement a > >>>> similar optimization for the blocking send. > >>>> > >>>> Unfortunately, this is not enough. The mca_pml_ob1_send_inline > >>>> optimization is horribly wrong in a multithreaded case as it > >> alter the > >>>> send_sequence without storing it. If you create a gap in the > >> send_sequence > >>>> a deadlock will __definitively__ occur. I strongly suggest you > >> turn off > >>>> the mca_pml_ob1_send_inline optimization on the multithreaded > >> case. All > >>>> the others optimizations should be safe in all cases. > >>>> > >>>> George. > >>>> > >>>> On Jan 8, 2014, at 01:15 , Shamis, Pavel <sham...@ornl.gov> > >> wrote: > >>>> > >>>>> Overall it looks good. It would be helpful to validate performance > >>>> numbers for other interconnects as well. > >>>>> -Pasha > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of > >> Nathan > >>>>>> Hjelm > >>>>>> Sent: Tuesday, January 07, 2014 6:45 PM > >>>>>> To: Open MPI Developers List > >>>>>> Subject: [OMPI devel] RFC: OB1 optimizations > >>>>>> > >>>>>> What: Push some ob1 optimizations to the trunk and 1.7.5. > >>>>>> > >>>>>> What: This patch contains two optimizations: > >>>>>> > >>>>>> - Introduce a fast send path for blocking send calls. This path > >> uses > >>>>>> the btl sendi function to put the data on the wire without the > >> need > >>>>>> for setting up a send request. In the case of btl/vader this > >> can > >>>>>> also avoid allocating/initializing a new fragment. With > >> btl/vader > >>>>>> this optimization improves small message latency by 50-200ns in > >>>>>> ping-pong type benchmarks. Larger messages may take a small hit > >> in > >>>>>> the range of 10-20ns. > >>>>>> > >>>>>> - Use a stack-allocated receive request for blocking recieves. > >> This > >>>>>> optimization saves the extra instructions associated with > >> accessing > >>>>>> the receive request free list. I was able to get another > >> 50-200ns > >>>>>> improvement in the small-message ping-pong with this > >> optimization. I > >>>>>> see no hit for larger messages. > >>>>>> > >>>>>> When: These changes touch the critical path in ob1 and are > >> targeted for > >>>>>> 1.7.5. As such I will set a moderately long timeout. Timeout set > >> for > >>>>>> next Friday (Jan 17). > >>>>>> > >>>>>> Some results from osu_latency on haswell: > >>>>>> > >>>>>> hjelmn@cn143 pt2pt]$ mpirun -n 2 --bind-to core -mca btl > >> vader,self > >>>>>> ./osu_latency > >>>>>> # OSU MPI Latency Test v4.0.1 > >>>>>> # Size Latency (us) > >>>>>> 0 0.11 > >>>>>> 1 0.14 > >>>>>> 2 0.14 > >>>>>> 4 0.14 > >>>>>> 8 0.14 > >>>>>> 16 0.14 > >>>>>> 32 0.15 > >>>>>> 64 0.18 > >>>>>> 128 0.36 > >>>>>> 256 0.37 > >>>>>> 512 0.46 > >>>>>> 1024 0.56 > >>>>>> 2048 0.80 > >>>>>> 4096 1.12 > >>>>>> 8192 1.68 > >>>>>> 16384 2.98 > >>>>>> 32768 5.10 > >>>>>> 65536 8.12 > >>>>>> 131072 14.07 > >>>>>> 262144 25.30 > >>>>>> 524288 47.40 > >>>>>> 1048576 91.71 > >>>>>> 2097152 195.56 > >>>>>> 4194304 487.05 > >>>>>> > >>>>>> > >>>>>> Patch Attached. > >>>>>> > >>>>>> -Nathan > >>>>> _______________________________________________ > >>>>> devel mailing list > >>>>> de...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>> > >>>> _______________________________________________ > >>>> devel mailing list > >>>> de...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> > >>> > >>> > >> > >> <ob1_optimization_take3.patch>_______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
# OSU MPI Latency Test v4.0.1 # Size Latency (us) 0 0.25 1 0.27 2 0.28 4 0.27 8 0.28 16 0.27 32 0.29 64 0.29 128 0.33 256 0.34 512 0.39 1024 0.45 2048 0.60 4096 1.00 8192 1.43 16384 2.68 32768 4.63 65536 7.23 131072 13.33 262144 23.96 524288 44.36 1048576 85.01 2097152 180.63 4194304 456.07
# OSU MPI Latency Test v4.0.1 # Size Latency (us) 0 0.18 1 0.23 2 0.23 4 0.24 8 0.24 16 0.24 32 0.24 64 0.26 128 0.40 256 0.42 512 0.49 1024 0.58 2048 0.82 4096 1.13 8192 1.68 16384 2.97 32768 5.08 65536 7.98 131072 14.05 262144 25.16 524288 47.46 1048576 91.99 2097152 192.97 4194304 493.23
# OSU MPI Latency Test v4.0.1 # Size Latency (us) 0 0.14 1 0.15 2 0.15 4 0.15 8 0.15 16 0.16 32 0.16 64 0.19 128 0.33 256 0.35 512 0.48 1024 0.60 2048 0.84 4096 1.13 8192 1.71 16384 2.95 32768 5.08 65536 8.07 131072 14.06 262144 25.37 524288 47.58 1048576 91.82 2097152 201.21 4194304 549.99
# OSU MPI Latency Test v4.0.1 # Size Latency (us) 0 0.21 1 0.23 2 0.23 4 0.23 8 0.23 16 0.23 32 0.24 64 0.24 128 0.28 256 0.30 512 0.38 1024 0.43 2048 0.60 4096 0.99 8192 1.42 16384 2.68 32768 4.60 65536 7.43 131072 13.27 262144 24.07 524288 44.47 1048576 85.09 2097152 180.43 4194304 457.18
# OSU MPI Latency Test v4.0.1 # Size Latency (us) 0 0.13 1 0.16 2 0.16 4 0.16 8 0.16 16 0.16 32 0.17 64 0.20 128 0.36 256 0.35 512 0.43 1024 0.47 2048 0.64 4096 1.18 8192 1.72 16384 3.04 32768 5.20 65536 8.13 131072 14.17 262144 25.34 524288 47.57 1048576 91.64 2097152 201.91 4194304 547.44
pgpuiD2f2GNPg.pgp
Description: PGP signature