Nathan, When you get access to the machine it might be interesting to show not only the after-patch performance but also what the trunk is getting on the same architecture.
George. On Jan 8, 2014, at 18:09 , Nathan Hjelm <hje...@lanl.gov> wrote: > Yeah. Its hard to say what the results will look like on Haswell. I > expect they should show some improvement from George's change but we > won't know until I can get to a Haswell node. Hopefully one becomes > available today. > > -Nathan > > On Wed, Jan 08, 2014 at 08:59:34AM -0800, Paul Hargrove wrote: >> Nevermind, since Nathan just clarified that the results are not >> comparable. >> >> -Paul [Sent from my phone] >> >> On Jan 8, 2014 8:58 AM, "Paul Hargrove" <phhargr...@lbl.gov> wrote: >> >> Interestingly enough the 4MB latency actually improved significantly >> relative to the initial numbers. >> >> -Paul [Sent from my phone] >> >> On Jan 8, 2014 8:50 AM, "George Bosilca" <bosi...@icl.utk.edu> wrote: >> >> These results are way worst that the one you send on your previous >> email? What is the reason? >> >> George. >> >> On Jan 8, 2014, at 17:33 , Nathan Hjelm <hje...@lanl.gov> wrote: >> >>> Ah, good catch. A new version is attached that should eliminate the >> race >>> window for the multi-threaded case. Performance numbers are still >>> looking really good. We beat mvapich2 in the small message ping-pong >> by >>> a good margin. See the results below. The large message latency >>> difference for large messages is probably due to a difference in the >> max >>> send size for vader vs mvapich. >>> >>> To answer Pasha's question. I don't see a noticiable difference in >>> performance for btl's with no sendi function (this includes >>> ugni). OpenIB should get a boost. I will test that once I get an >>> allocation. >>> >>> CPU: Xeon E5-2670 @ 2.60 GHz >>> >>> Open MPI (-mca btl vader,self): >>> # OSU MPI Latency Test v4.1 >>> # Size Latency (us) >>> 0 0.17 >>> 1 0.19 >>> 2 0.19 >>> 4 0.19 >>> 8 0.19 >>> 16 0.19 >>> 32 0.19 >>> 64 0.40 >>> 128 0.40 >>> 256 0.43 >>> 512 0.52 >>> 1024 0.67 >>> 2048 0.94 >>> 4096 1.44 >>> 8192 2.04 >>> 16384 3.47 >>> 32768 6.10 >>> 65536 9.38 >>> 131072 16.47 >>> 262144 29.63 >>> 524288 54.81 >>> 1048576 106.63 >>> 2097152 206.84 >>> 4194304 421.26 >>> >>> >>> mvapich2 1.9: >>> # OSU MPI Latency Test >>> # Size Latency (us) >>> 0 0.23 >>> 1 0.23 >>> 2 0.23 >>> 4 0.23 >>> 8 0.23 >>> 16 0.28 >>> 32 0.28 >>> 64 0.39 >>> 128 0.40 >>> 256 0.40 >>> 512 0.42 >>> 1024 0.51 >>> 2048 0.71 >>> 4096 1.02 >>> 8192 1.60 >>> 16384 3.47 >>> 32768 5.05 >>> 65536 8.06 >>> 131072 14.82 >>> 262144 28.15 >>> 524288 53.69 >>> 1048576 127.47 >>> 2097152 235.58 >>> 4194304 683.90 >>> >>> >>> -Nathan >>> >>> On Tue, Jan 07, 2014 at 06:23:13PM -0700, George Bosilca wrote: >>>> The local request is not correctly released, leading to assert in >> debug >>>> mode. This is because you avoid calling >> MCA_PML_BASE_RECV_REQUEST_FINI, >>>> fact that leaves the request in an ACTIVE state, condition >> carefully >>>> checked during the call to destructor. >>>> >>>> I attached a second patch that fixes the issue above, and >> implement a >>>> similar optimization for the blocking send. >>>> >>>> Unfortunately, this is not enough. The mca_pml_ob1_send_inline >>>> optimization is horribly wrong in a multithreaded case as it >> alter the >>>> send_sequence without storing it. If you create a gap in the >> send_sequence >>>> a deadlock will __definitively__ occur. I strongly suggest you >> turn off >>>> the mca_pml_ob1_send_inline optimization on the multithreaded >> case. All >>>> the others optimizations should be safe in all cases. >>>> >>>> George. >>>> >>>> On Jan 8, 2014, at 01:15 , Shamis, Pavel <sham...@ornl.gov> >> wrote: >>>> >>>>> Overall it looks good. It would be helpful to validate performance >>>> numbers for other interconnects as well. >>>>> -Pasha >>>>> >>>>>> -----Original Message----- >>>>>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of >> Nathan >>>>>> Hjelm >>>>>> Sent: Tuesday, January 07, 2014 6:45 PM >>>>>> To: Open MPI Developers List >>>>>> Subject: [OMPI devel] RFC: OB1 optimizations >>>>>> >>>>>> What: Push some ob1 optimizations to the trunk and 1.7.5. >>>>>> >>>>>> What: This patch contains two optimizations: >>>>>> >>>>>> - Introduce a fast send path for blocking send calls. This path >> uses >>>>>> the btl sendi function to put the data on the wire without the >> need >>>>>> for setting up a send request. In the case of btl/vader this >> can >>>>>> also avoid allocating/initializing a new fragment. With >> btl/vader >>>>>> this optimization improves small message latency by 50-200ns in >>>>>> ping-pong type benchmarks. Larger messages may take a small hit >> in >>>>>> the range of 10-20ns. >>>>>> >>>>>> - Use a stack-allocated receive request for blocking recieves. >> This >>>>>> optimization saves the extra instructions associated with >> accessing >>>>>> the receive request free list. I was able to get another >> 50-200ns >>>>>> improvement in the small-message ping-pong with this >> optimization. I >>>>>> see no hit for larger messages. >>>>>> >>>>>> When: These changes touch the critical path in ob1 and are >> targeted for >>>>>> 1.7.5. As such I will set a moderately long timeout. Timeout set >> for >>>>>> next Friday (Jan 17). >>>>>> >>>>>> Some results from osu_latency on haswell: >>>>>> >>>>>> hjelmn@cn143 pt2pt]$ mpirun -n 2 --bind-to core -mca btl >> vader,self >>>>>> ./osu_latency >>>>>> # OSU MPI Latency Test v4.0.1 >>>>>> # Size Latency (us) >>>>>> 0 0.11 >>>>>> 1 0.14 >>>>>> 2 0.14 >>>>>> 4 0.14 >>>>>> 8 0.14 >>>>>> 16 0.14 >>>>>> 32 0.15 >>>>>> 64 0.18 >>>>>> 128 0.36 >>>>>> 256 0.37 >>>>>> 512 0.46 >>>>>> 1024 0.56 >>>>>> 2048 0.80 >>>>>> 4096 1.12 >>>>>> 8192 1.68 >>>>>> 16384 2.98 >>>>>> 32768 5.10 >>>>>> 65536 8.12 >>>>>> 131072 14.07 >>>>>> 262144 25.30 >>>>>> 524288 47.40 >>>>>> 1048576 91.71 >>>>>> 2097152 195.56 >>>>>> 4194304 487.05 >>>>>> >>>>>> >>>>>> Patch Attached. >>>>>> >>>>>> -Nathan >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> >> >> <ob1_optimization_take3.patch>_______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel