Re: [OMPI devel] RFC: OB1 optimizations

Paul Hargrove Wed, 8 Jan 2014 11:58:19 -0500 (EST)

Interestingly enough the 4MB latency actually improved significantly
relative to the initial numbers.


-Paul [Sent from my phone]
On Jan 8, 2014 8:50 AM, "George Bosilca" <[email protected]> wrote:

> These results are way worst that the one you send on your previous email?
> What is the reason?
>
>   George.
>
> On Jan 8, 2014, at 17:33 , Nathan Hjelm <[email protected]> wrote:
>
> > Ah, good catch. A new version is attached that should eliminate the race
> > window for the multi-threaded case. Performance numbers are still
> > looking really good. We beat mvapich2 in the small message ping-pong by
> > a good margin. See the results below. The large message latency
> > difference for large messages is probably due to a difference in the max
> > send size for vader vs mvapich.
> >
> > To answer Pasha's question. I don't see a noticiable difference in
> > performance for btl's with no sendi function (this includes
> > ugni). OpenIB should get a boost. I will test that once I get an
> > allocation.
> >
> > CPU: Xeon E5-2670 @ 2.60 GHz
> >
> > Open MPI (-mca btl vader,self):
> > # OSU MPI Latency Test v4.1
> > # Size          Latency (us)
> > 0                       0.17
> > 1                       0.19
> > 2                       0.19
> > 4                       0.19
> > 8                       0.19
> > 16                      0.19
> > 32                      0.19
> > 64                      0.40
> > 128                     0.40
> > 256                     0.43
> > 512                     0.52
> > 1024                    0.67
> > 2048                    0.94
> > 4096                    1.44
> > 8192                    2.04
> > 16384                   3.47
> > 32768                   6.10
> > 65536                   9.38
> > 131072                 16.47
> > 262144                 29.63
> > 524288                 54.81
> > 1048576               106.63
> > 2097152               206.84
> > 4194304               421.26
> >
> >
> > mvapich2 1.9:
> > # OSU MPI Latency Test
> > # Size            Latency (us)
> > 0                         0.23
> > 1                         0.23
> > 2                         0.23
> > 4                         0.23
> > 8                         0.23
> > 16                        0.28
> > 32                        0.28
> > 64                        0.39
> > 128                       0.40
> > 256                       0.40
> > 512                       0.42
> > 1024                      0.51
> > 2048                      0.71
> > 4096                      1.02
> > 8192                      1.60
> > 16384                     3.47
> > 32768                     5.05
> > 65536                     8.06
> > 131072                   14.82
> > 262144                   28.15
> > 524288                   53.69
> > 1048576                 127.47
> > 2097152                 235.58
> > 4194304                 683.90
> >
> >
> > -Nathan
> >
> > On Tue, Jan 07, 2014 at 06:23:13PM -0700, George Bosilca wrote:
> >>   The local request is not correctly released, leading to assert in
> debug
> >>   mode. This is because you avoid calling
> MCA_PML_BASE_RECV_REQUEST_FINI,
> >>   fact that leaves the request in an ACTIVE state, condition carefully
> >>   checked during the call to destructor.
> >>
> >>   I attached a second patch that fixes the issue above, and implement a
> >>   similar optimization for the blocking send.
> >>
> >>   Unfortunately, this is not enough. The mca_pml_ob1_send_inline
> >>   optimization is horribly wrong in a multithreaded case as it alter the
> >>   send_sequence without storing it. If you create a gap in the
> send_sequence
> >>   a deadlock will __definitively__ occur. I strongly suggest you turn
> off
> >>   the mca_pml_ob1_send_inline optimization on the multithreaded case.
> All
> >>   the others optimizations should be safe in all cases.
> >>
> >>     George.
> >>
> >>   On Jan 8, 2014, at 01:15 , Shamis, Pavel <[email protected]> wrote:
> >>
> >>> Overall it looks good. It would be helpful to validate performance
> >>   numbers for other interconnects as well.
> >>> -Pasha
> >>>
> >>>> -----Original Message-----
> >>>> From: devel [mailto:[email protected]] On Behalf Of Nathan
> >>>> Hjelm
> >>>> Sent: Tuesday, January 07, 2014 6:45 PM
> >>>> To: Open MPI Developers List
> >>>> Subject: [OMPI devel] RFC: OB1 optimizations
> >>>>
> >>>> What: Push some ob1 optimizations to the trunk and 1.7.5.
> >>>>
> >>>> What: This patch contains two optimizations:
> >>>>
> >>>> - Introduce a fast send path for blocking send calls. This path uses
> >>>>   the btl sendi function to put the data on the wire without the need
> >>>>   for setting up a send request. In the case of btl/vader this can
> >>>>   also avoid allocating/initializing a new fragment. With btl/vader
> >>>>   this optimization improves small message latency by 50-200ns in
> >>>>   ping-pong type benchmarks. Larger messages may take a small hit in
> >>>>   the range of 10-20ns.
> >>>>
> >>>> - Use a stack-allocated receive request for blocking recieves. This
> >>>>   optimization saves the extra instructions associated with accessing
> >>>>   the receive request free list. I was able to get another 50-200ns
> >>>>   improvement in the small-message ping-pong with this optimization. I
> >>>>   see no hit for larger messages.
> >>>>
> >>>> When: These changes touch the critical path in ob1 and are targeted
> for
> >>>> 1.7.5. As such I will set a moderately long timeout. Timeout set for
> >>>> next Friday (Jan 17).
> >>>>
> >>>> Some results from osu_latency on haswell:
> >>>>
> >>>> hjelmn@cn143 pt2pt]$ mpirun -n 2 --bind-to core -mca btl vader,self
> >>>> ./osu_latency
> >>>> # OSU MPI Latency Test v4.0.1
> >>>> # Size          Latency (us)
> >>>> 0                       0.11
> >>>> 1                       0.14
> >>>> 2                       0.14
> >>>> 4                       0.14
> >>>> 8                       0.14
> >>>> 16                      0.14
> >>>> 32                      0.15
> >>>> 64                      0.18
> >>>> 128                     0.36
> >>>> 256                     0.37
> >>>> 512                     0.46
> >>>> 1024                    0.56
> >>>> 2048                    0.80
> >>>> 4096                    1.12
> >>>> 8192                    1.68
> >>>> 16384                   2.98
> >>>> 32768                   5.10
> >>>> 65536                   8.12
> >>>> 131072                 14.07
> >>>> 262144                 25.30
> >>>> 524288                 47.40
> >>>> 1048576                91.71
> >>>> 2097152               195.56
> >>>> 4194304               487.05
> >>>>
> >>>>
> >>>> Patch Attached.
> >>>>
> >>>> -Nathan
> >>> _______________________________________________
> >>> devel mailing list
> >>> [email protected]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >>   _______________________________________________
> >>   devel mailing list
> >>   [email protected]
> >>   http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> >
> >
> <ob1_optimization_take3.patch>_______________________________________________
> > devel mailing list
> > [email protected]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

Re: [OMPI devel] RFC: OB1 optimizations

Reply via email to