These results are way worst that the one you send on your previous email? What is the reason?
George. On Jan 8, 2014, at 17:33 , Nathan Hjelm <hje...@lanl.gov> wrote: > Ah, good catch. A new version is attached that should eliminate the race > window for the multi-threaded case. Performance numbers are still > looking really good. We beat mvapich2 in the small message ping-pong by > a good margin. See the results below. The large message latency > difference for large messages is probably due to a difference in the max > send size for vader vs mvapich. > > To answer Pasha's question. I don't see a noticiable difference in > performance for btl's with no sendi function (this includes > ugni). OpenIB should get a boost. I will test that once I get an > allocation. > > CPU: Xeon E5-2670 @ 2.60 GHz > > Open MPI (-mca btl vader,self): > # OSU MPI Latency Test v4.1 > # Size Latency (us) > 0 0.17 > 1 0.19 > 2 0.19 > 4 0.19 > 8 0.19 > 16 0.19 > 32 0.19 > 64 0.40 > 128 0.40 > 256 0.43 > 512 0.52 > 1024 0.67 > 2048 0.94 > 4096 1.44 > 8192 2.04 > 16384 3.47 > 32768 6.10 > 65536 9.38 > 131072 16.47 > 262144 29.63 > 524288 54.81 > 1048576 106.63 > 2097152 206.84 > 4194304 421.26 > > > mvapich2 1.9: > # OSU MPI Latency Test > # Size Latency (us) > 0 0.23 > 1 0.23 > 2 0.23 > 4 0.23 > 8 0.23 > 16 0.28 > 32 0.28 > 64 0.39 > 128 0.40 > 256 0.40 > 512 0.42 > 1024 0.51 > 2048 0.71 > 4096 1.02 > 8192 1.60 > 16384 3.47 > 32768 5.05 > 65536 8.06 > 131072 14.82 > 262144 28.15 > 524288 53.69 > 1048576 127.47 > 2097152 235.58 > 4194304 683.90 > > > -Nathan > > On Tue, Jan 07, 2014 at 06:23:13PM -0700, George Bosilca wrote: >> The local request is not correctly released, leading to assert in debug >> mode. This is because you avoid calling MCA_PML_BASE_RECV_REQUEST_FINI, >> fact that leaves the request in an ACTIVE state, condition carefully >> checked during the call to destructor. >> >> I attached a second patch that fixes the issue above, and implement a >> similar optimization for the blocking send. >> >> Unfortunately, this is not enough. The mca_pml_ob1_send_inline >> optimization is horribly wrong in a multithreaded case as it alter the >> send_sequence without storing it. If you create a gap in the send_sequence >> a deadlock will __definitively__ occur. I strongly suggest you turn off >> the mca_pml_ob1_send_inline optimization on the multithreaded case. All >> the others optimizations should be safe in all cases. >> >> George. >> >> On Jan 8, 2014, at 01:15 , Shamis, Pavel <sham...@ornl.gov> wrote: >> >>> Overall it looks good. It would be helpful to validate performance >> numbers for other interconnects as well. >>> -Pasha >>> >>>> -----Original Message----- >>>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Nathan >>>> Hjelm >>>> Sent: Tuesday, January 07, 2014 6:45 PM >>>> To: Open MPI Developers List >>>> Subject: [OMPI devel] RFC: OB1 optimizations >>>> >>>> What: Push some ob1 optimizations to the trunk and 1.7.5. >>>> >>>> What: This patch contains two optimizations: >>>> >>>> - Introduce a fast send path for blocking send calls. This path uses >>>> the btl sendi function to put the data on the wire without the need >>>> for setting up a send request. In the case of btl/vader this can >>>> also avoid allocating/initializing a new fragment. With btl/vader >>>> this optimization improves small message latency by 50-200ns in >>>> ping-pong type benchmarks. Larger messages may take a small hit in >>>> the range of 10-20ns. >>>> >>>> - Use a stack-allocated receive request for blocking recieves. This >>>> optimization saves the extra instructions associated with accessing >>>> the receive request free list. I was able to get another 50-200ns >>>> improvement in the small-message ping-pong with this optimization. I >>>> see no hit for larger messages. >>>> >>>> When: These changes touch the critical path in ob1 and are targeted for >>>> 1.7.5. As such I will set a moderately long timeout. Timeout set for >>>> next Friday (Jan 17). >>>> >>>> Some results from osu_latency on haswell: >>>> >>>> hjelmn@cn143 pt2pt]$ mpirun -n 2 --bind-to core -mca btl vader,self >>>> ./osu_latency >>>> # OSU MPI Latency Test v4.0.1 >>>> # Size Latency (us) >>>> 0 0.11 >>>> 1 0.14 >>>> 2 0.14 >>>> 4 0.14 >>>> 8 0.14 >>>> 16 0.14 >>>> 32 0.15 >>>> 64 0.18 >>>> 128 0.36 >>>> 256 0.37 >>>> 512 0.46 >>>> 1024 0.56 >>>> 2048 0.80 >>>> 4096 1.12 >>>> 8192 1.68 >>>> 16384 2.98 >>>> 32768 5.10 >>>> 65536 8.12 >>>> 131072 14.07 >>>> 262144 25.30 >>>> 524288 47.40 >>>> 1048576 91.71 >>>> 2097152 195.56 >>>> 4194304 487.05 >>>> >>>> >>>> Patch Attached. >>>> >>>> -Nathan >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > <ob1_optimization_take3.patch>_______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel