Interestingly enough the 4MB latency actually improved significantly relative to the initial numbers.
-Paul [Sent from my phone] On Jan 8, 2014 8:50 AM, "George Bosilca" <bosi...@icl.utk.edu> wrote: > These results are way worst that the one you send on your previous email? > What is the reason? > > George. > > On Jan 8, 2014, at 17:33 , Nathan Hjelm <hje...@lanl.gov> wrote: > > > Ah, good catch. A new version is attached that should eliminate the race > > window for the multi-threaded case. Performance numbers are still > > looking really good. We beat mvapich2 in the small message ping-pong by > > a good margin. See the results below. The large message latency > > difference for large messages is probably due to a difference in the max > > send size for vader vs mvapich. > > > > To answer Pasha's question. I don't see a noticiable difference in > > performance for btl's with no sendi function (this includes > > ugni). OpenIB should get a boost. I will test that once I get an > > allocation. > > > > CPU: Xeon E5-2670 @ 2.60 GHz > > > > Open MPI (-mca btl vader,self): > > # OSU MPI Latency Test v4.1 > > # Size Latency (us) > > 0 0.17 > > 1 0.19 > > 2 0.19 > > 4 0.19 > > 8 0.19 > > 16 0.19 > > 32 0.19 > > 64 0.40 > > 128 0.40 > > 256 0.43 > > 512 0.52 > > 1024 0.67 > > 2048 0.94 > > 4096 1.44 > > 8192 2.04 > > 16384 3.47 > > 32768 6.10 > > 65536 9.38 > > 131072 16.47 > > 262144 29.63 > > 524288 54.81 > > 1048576 106.63 > > 2097152 206.84 > > 4194304 421.26 > > > > > > mvapich2 1.9: > > # OSU MPI Latency Test > > # Size Latency (us) > > 0 0.23 > > 1 0.23 > > 2 0.23 > > 4 0.23 > > 8 0.23 > > 16 0.28 > > 32 0.28 > > 64 0.39 > > 128 0.40 > > 256 0.40 > > 512 0.42 > > 1024 0.51 > > 2048 0.71 > > 4096 1.02 > > 8192 1.60 > > 16384 3.47 > > 32768 5.05 > > 65536 8.06 > > 131072 14.82 > > 262144 28.15 > > 524288 53.69 > > 1048576 127.47 > > 2097152 235.58 > > 4194304 683.90 > > > > > > -Nathan > > > > On Tue, Jan 07, 2014 at 06:23:13PM -0700, George Bosilca wrote: > >> The local request is not correctly released, leading to assert in > debug > >> mode. This is because you avoid calling > MCA_PML_BASE_RECV_REQUEST_FINI, > >> fact that leaves the request in an ACTIVE state, condition carefully > >> checked during the call to destructor. > >> > >> I attached a second patch that fixes the issue above, and implement a > >> similar optimization for the blocking send. > >> > >> Unfortunately, this is not enough. The mca_pml_ob1_send_inline > >> optimization is horribly wrong in a multithreaded case as it alter the > >> send_sequence without storing it. If you create a gap in the > send_sequence > >> a deadlock will __definitively__ occur. I strongly suggest you turn > off > >> the mca_pml_ob1_send_inline optimization on the multithreaded case. > All > >> the others optimizations should be safe in all cases. > >> > >> George. > >> > >> On Jan 8, 2014, at 01:15 , Shamis, Pavel <sham...@ornl.gov> wrote: > >> > >>> Overall it looks good. It would be helpful to validate performance > >> numbers for other interconnects as well. > >>> -Pasha > >>> > >>>> -----Original Message----- > >>>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Nathan > >>>> Hjelm > >>>> Sent: Tuesday, January 07, 2014 6:45 PM > >>>> To: Open MPI Developers List > >>>> Subject: [OMPI devel] RFC: OB1 optimizations > >>>> > >>>> What: Push some ob1 optimizations to the trunk and 1.7.5. > >>>> > >>>> What: This patch contains two optimizations: > >>>> > >>>> - Introduce a fast send path for blocking send calls. This path uses > >>>> the btl sendi function to put the data on the wire without the need > >>>> for setting up a send request. In the case of btl/vader this can > >>>> also avoid allocating/initializing a new fragment. With btl/vader > >>>> this optimization improves small message latency by 50-200ns in > >>>> ping-pong type benchmarks. Larger messages may take a small hit in > >>>> the range of 10-20ns. > >>>> > >>>> - Use a stack-allocated receive request for blocking recieves. This > >>>> optimization saves the extra instructions associated with accessing > >>>> the receive request free list. I was able to get another 50-200ns > >>>> improvement in the small-message ping-pong with this optimization. I > >>>> see no hit for larger messages. > >>>> > >>>> When: These changes touch the critical path in ob1 and are targeted > for > >>>> 1.7.5. As such I will set a moderately long timeout. Timeout set for > >>>> next Friday (Jan 17). > >>>> > >>>> Some results from osu_latency on haswell: > >>>> > >>>> hjelmn@cn143 pt2pt]$ mpirun -n 2 --bind-to core -mca btl vader,self > >>>> ./osu_latency > >>>> # OSU MPI Latency Test v4.0.1 > >>>> # Size Latency (us) > >>>> 0 0.11 > >>>> 1 0.14 > >>>> 2 0.14 > >>>> 4 0.14 > >>>> 8 0.14 > >>>> 16 0.14 > >>>> 32 0.15 > >>>> 64 0.18 > >>>> 128 0.36 > >>>> 256 0.37 > >>>> 512 0.46 > >>>> 1024 0.56 > >>>> 2048 0.80 > >>>> 4096 1.12 > >>>> 8192 1.68 > >>>> 16384 2.98 > >>>> 32768 5.10 > >>>> 65536 8.12 > >>>> 131072 14.07 > >>>> 262144 25.30 > >>>> 524288 47.40 > >>>> 1048576 91.71 > >>>> 2097152 195.56 > >>>> 4194304 487.05 > >>>> > >>>> > >>>> Patch Attached. > >>>> > >>>> -Nathan > >>> _______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > > <ob1_optimization_take3.patch>_______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >