Re: [OMPI devel] RFC: OB1 optimizations

Nathan Hjelm Fri, 10 Jan 2014 11:56:00 -0500 (EST)

Looks like it slowed down by about 20ns from the original patch. That is
to be expected when latencies are this low. Results for the following
are attached:


 - Trunk r30215 sm and vader results for osu_latency.
 - Trunk r30215 + patch take3 for both sm and vader.
 - Trunk r30215 + patch + forced 16 byte match header for vader.

The last one is not completely surprising. The current match header is
14 bytes which means the memcpy for the data is not aligned for a 64-bit
architecture. Might be worth looking at bumping the match header size up
as another optimization.

-Nathan

On Fri, Jan 10, 2014 at 02:24:19PM +0100, George Bosilca wrote:
> Nathan,
> 
> When you get access to the machine it might be interesting to show not only 
> the after-patch performance but also what the trunk is getting on the same 
> architecture.
> 
>   George.
> 
> On Jan 8, 2014, at 18:09 , Nathan Hjelm <[email protected]> wrote:
> 
> > Yeah. Its hard to say what the results will look like on Haswell. I
> > expect they should show some improvement from George's change but we
> > won't know until I can get to a Haswell node. Hopefully one becomes
> > available today.
> > 
> > -Nathan
> > 
> > On Wed, Jan 08, 2014 at 08:59:34AM -0800, Paul Hargrove wrote:
> >>   Nevermind, since Nathan just clarified that the results are not
> >>   comparable.
> >> 
> >>   -Paul [Sent from my phone]
> >> 
> >>   On Jan 8, 2014 8:58 AM, "Paul Hargrove" <[email protected]> wrote:
> >> 
> >>     Interestingly enough the 4MB latency actually improved significantly
> >>     relative to the initial numbers.
> >> 
> >>     -Paul [Sent from my phone]
> >> 
> >>     On Jan 8, 2014 8:50 AM, "George Bosilca" <[email protected]> wrote:
> >> 
> >>       These results are way worst that the one you send on your previous
> >>       email? What is the reason?
> >> 
> >>         George.
> >> 
> >>       On Jan 8, 2014, at 17:33 , Nathan Hjelm <[email protected]> wrote:
> >> 
> >>> Ah, good catch. A new version is attached that should eliminate the
> >>       race
> >>> window for the multi-threaded case. Performance numbers are still
> >>> looking really good. We beat mvapich2 in the small message ping-pong
> >>       by
> >>> a good margin. See the results below. The large message latency
> >>> difference for large messages is probably due to a difference in the
> >>       max
> >>> send size for vader vs mvapich.
> >>> 
> >>> To answer Pasha's question. I don't see a noticiable difference in
> >>> performance for btl's with no sendi function (this includes
> >>> ugni). OpenIB should get a boost. I will test that once I get an
> >>> allocation.
> >>> 
> >>> CPU: Xeon E5-2670 @ 2.60 GHz
> >>> 
> >>> Open MPI (-mca btl vader,self):
> >>> # OSU MPI Latency Test v4.1
> >>> # Size          Latency (us)
> >>> 0                       0.17
> >>> 1                       0.19
> >>> 2                       0.19
> >>> 4                       0.19
> >>> 8                       0.19
> >>> 16                      0.19
> >>> 32                      0.19
> >>> 64                      0.40
> >>> 128                     0.40
> >>> 256                     0.43
> >>> 512                     0.52
> >>> 1024                    0.67
> >>> 2048                    0.94
> >>> 4096                    1.44
> >>> 8192                    2.04
> >>> 16384                   3.47
> >>> 32768                   6.10
> >>> 65536                   9.38
> >>> 131072                 16.47
> >>> 262144                 29.63
> >>> 524288                 54.81
> >>> 1048576               106.63
> >>> 2097152               206.84
> >>> 4194304               421.26
> >>> 
> >>> 
> >>> mvapich2 1.9:
> >>> # OSU MPI Latency Test
> >>> # Size            Latency (us)
> >>> 0                         0.23
> >>> 1                         0.23
> >>> 2                         0.23
> >>> 4                         0.23
> >>> 8                         0.23
> >>> 16                        0.28
> >>> 32                        0.28
> >>> 64                        0.39
> >>> 128                       0.40
> >>> 256                       0.40
> >>> 512                       0.42
> >>> 1024                      0.51
> >>> 2048                      0.71
> >>> 4096                      1.02
> >>> 8192                      1.60
> >>> 16384                     3.47
> >>> 32768                     5.05
> >>> 65536                     8.06
> >>> 131072                   14.82
> >>> 262144                   28.15
> >>> 524288                   53.69
> >>> 1048576                 127.47
> >>> 2097152                 235.58
> >>> 4194304                 683.90
> >>> 
> >>> 
> >>> -Nathan
> >>> 
> >>> On Tue, Jan 07, 2014 at 06:23:13PM -0700, George Bosilca wrote:
> >>>>  The local request is not correctly released, leading to assert in
> >>       debug
> >>>>  mode. This is because you avoid calling
> >>       MCA_PML_BASE_RECV_REQUEST_FINI,
> >>>>  fact that leaves the request in an ACTIVE state, condition
> >>       carefully
> >>>>  checked during the call to destructor.
> >>>> 
> >>>>  I attached a second patch that fixes the issue above, and
> >>       implement a
> >>>>  similar optimization for the blocking send.
> >>>> 
> >>>>  Unfortunately, this is not enough. The mca_pml_ob1_send_inline
> >>>>  optimization is horribly wrong in a multithreaded case as it
> >>       alter the
> >>>>  send_sequence without storing it. If you create a gap in the
> >>       send_sequence
> >>>>  a deadlock will __definitively__ occur. I strongly suggest you
> >>       turn off
> >>>>  the mca_pml_ob1_send_inline optimization on the multithreaded
> >>       case. All
> >>>>  the others optimizations should be safe in all cases.
> >>>> 
> >>>>    George.
> >>>> 
> >>>>  On Jan 8, 2014, at 01:15 , Shamis, Pavel <[email protected]>
> >>       wrote:
> >>>> 
> >>>>> Overall it looks good. It would be helpful to validate performance
> >>>>  numbers for other interconnects as well.
> >>>>> -Pasha
> >>>>> 
> >>>>>> -----Original Message-----
> >>>>>> From: devel [mailto:[email protected]] On Behalf Of
> >>       Nathan
> >>>>>> Hjelm
> >>>>>> Sent: Tuesday, January 07, 2014 6:45 PM
> >>>>>> To: Open MPI Developers List
> >>>>>> Subject: [OMPI devel] RFC: OB1 optimizations
> >>>>>> 
> >>>>>> What: Push some ob1 optimizations to the trunk and 1.7.5.
> >>>>>> 
> >>>>>> What: This patch contains two optimizations:
> >>>>>> 
> >>>>>> - Introduce a fast send path for blocking send calls. This path
> >>       uses
> >>>>>>  the btl sendi function to put the data on the wire without the
> >>       need
> >>>>>>  for setting up a send request. In the case of btl/vader this
> >>       can
> >>>>>>  also avoid allocating/initializing a new fragment. With
> >>       btl/vader
> >>>>>>  this optimization improves small message latency by 50-200ns in
> >>>>>>  ping-pong type benchmarks. Larger messages may take a small hit
> >>       in
> >>>>>>  the range of 10-20ns.
> >>>>>> 
> >>>>>> - Use a stack-allocated receive request for blocking recieves.
> >>       This
> >>>>>>  optimization saves the extra instructions associated with
> >>       accessing
> >>>>>>  the receive request free list. I was able to get another
> >>       50-200ns
> >>>>>>  improvement in the small-message ping-pong with this
> >>       optimization. I
> >>>>>>  see no hit for larger messages.
> >>>>>> 
> >>>>>> When: These changes touch the critical path in ob1 and are
> >>       targeted for
> >>>>>> 1.7.5. As such I will set a moderately long timeout. Timeout set
> >>       for
> >>>>>> next Friday (Jan 17).
> >>>>>> 
> >>>>>> Some results from osu_latency on haswell:
> >>>>>> 
> >>>>>> hjelmn@cn143 pt2pt]$ mpirun -n 2 --bind-to core -mca btl
> >>       vader,self
> >>>>>> ./osu_latency
> >>>>>> # OSU MPI Latency Test v4.0.1
> >>>>>> # Size          Latency (us)
> >>>>>> 0                       0.11
> >>>>>> 1                       0.14
> >>>>>> 2                       0.14
> >>>>>> 4                       0.14
> >>>>>> 8                       0.14
> >>>>>> 16                      0.14
> >>>>>> 32                      0.15
> >>>>>> 64                      0.18
> >>>>>> 128                     0.36
> >>>>>> 256                     0.37
> >>>>>> 512                     0.46
> >>>>>> 1024                    0.56
> >>>>>> 2048                    0.80
> >>>>>> 4096                    1.12
> >>>>>> 8192                    1.68
> >>>>>> 16384                   2.98
> >>>>>> 32768                   5.10
> >>>>>> 65536                   8.12
> >>>>>> 131072                 14.07
> >>>>>> 262144                 25.30
> >>>>>> 524288                 47.40
> >>>>>> 1048576                91.71
> >>>>>> 2097152               195.56
> >>>>>> 4194304               487.05
> >>>>>> 
> >>>>>> 
> >>>>>> Patch Attached.
> >>>>>> 
> >>>>>> -Nathan
> >>>>> _______________________________________________
> >>>>> devel mailing list
> >>>>> [email protected]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>> 
> >>>>  _______________________________________________
> >>>>  devel mailing list
> >>>>  [email protected]
> >>>>  http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> 
> >>> 
> >>> 
> >>       
> >> <ob1_optimization_take3.patch>_______________________________________________
> >>> devel mailing list
> >>> [email protected]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> 
> >>       _______________________________________________
> >>       devel mailing list
> >>       [email protected]
> >>       http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> >> _______________________________________________
> >> devel mailing list
> >> [email protected]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > _______________________________________________
> > devel mailing list
> > [email protected]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

# OSU MPI Latency Test v4.0.1
# Size          Latency (us)
0                       0.25
1                       0.27
2                       0.28
4                       0.27
8                       0.28
16                      0.27
32                      0.29
64                      0.29
128                     0.33
256                     0.34
512                     0.39
1024                    0.45
2048                    0.60
4096                    1.00
8192                    1.43
16384                   2.68
32768                   4.63
65536                   7.23
131072                 13.33
262144                 23.96
524288                 44.36
1048576                85.01
2097152               180.63
4194304               456.07

# OSU MPI Latency Test v4.0.1
# Size          Latency (us)
0                       0.18
1                       0.23
2                       0.23
4                       0.24
8                       0.24
16                      0.24
32                      0.24
64                      0.26
128                     0.40
256                     0.42
512                     0.49
1024                    0.58
2048                    0.82
4096                    1.13
8192                    1.68
16384                   2.97
32768                   5.08
65536                   7.98
131072                 14.05
262144                 25.16
524288                 47.46
1048576                91.99
2097152               192.97
4194304               493.23

# OSU MPI Latency Test v4.0.1
# Size          Latency (us)
0                       0.14
1                       0.15
2                       0.15
4                       0.15
8                       0.15
16                      0.16
32                      0.16
64                      0.19
128                     0.33
256                     0.35
512                     0.48
1024                    0.60
2048                    0.84
4096                    1.13
8192                    1.71
16384                   2.95
32768                   5.08
65536                   8.07
131072                 14.06
262144                 25.37
524288                 47.58
1048576                91.82
2097152               201.21
4194304               549.99

# OSU MPI Latency Test v4.0.1
# Size          Latency (us)
0                       0.21
1                       0.23
2                       0.23
4                       0.23
8                       0.23
16                      0.23
32                      0.24
64                      0.24
128                     0.28
256                     0.30
512                     0.38
1024                    0.43
2048                    0.60
4096                    0.99
8192                    1.42
16384                   2.68
32768                   4.60
65536                   7.43
131072                 13.27
262144                 24.07
524288                 44.47
1048576                85.09
2097152               180.43
4194304               457.18

# OSU MPI Latency Test v4.0.1
# Size          Latency (us)
0                       0.13
1                       0.16
2                       0.16
4                       0.16
8                       0.16
16                      0.16
32                      0.17
64                      0.20
128                     0.36
256                     0.35
512                     0.43
1024                    0.47
2048                    0.64
4096                    1.18
8192                    1.72
16384                   3.04
32768                   5.20
65536                   8.13
131072                 14.17
262144                 25.34
524288                 47.57
1048576                91.64
2097152               201.91
4194304               547.44

pgpuiD2f2GNPg.pgp
Description: PGP signature

Re: [OMPI devel] RFC: OB1 optimizations

Reply via email to