Re: [OMPI devel] RFC: OB1 optimizations

George Bosilca Wed, 8 Jan 2014 11:50:38 -0500 (EST)

These results are way worst that the one you send on your previous email? What 
is the reason?


  George.

On Jan 8, 2014, at 17:33 , Nathan Hjelm <hje...@lanl.gov> wrote:

> Ah, good catch. A new version is attached that should eliminate the race
> window for the multi-threaded case. Performance numbers are still
> looking really good. We beat mvapich2 in the small message ping-pong by
> a good margin. See the results below. The large message latency
> difference for large messages is probably due to a difference in the max
> send size for vader vs mvapich.
> 
> To answer Pasha's question. I don't see a noticiable difference in
> performance for btl's with no sendi function (this includes
> ugni). OpenIB should get a boost. I will test that once I get an
> allocation.
> 
> CPU: Xeon E5-2670 @ 2.60 GHz
> 
> Open MPI (-mca btl vader,self):
> # OSU MPI Latency Test v4.1
> # Size          Latency (us)
> 0                       0.17
> 1                       0.19
> 2                       0.19
> 4                       0.19
> 8                       0.19
> 16                      0.19
> 32                      0.19
> 64                      0.40
> 128                     0.40
> 256                     0.43
> 512                     0.52
> 1024                    0.67
> 2048                    0.94
> 4096                    1.44
> 8192                    2.04
> 16384                   3.47
> 32768                   6.10
> 65536                   9.38
> 131072                 16.47
> 262144                 29.63
> 524288                 54.81
> 1048576               106.63
> 2097152               206.84
> 4194304               421.26
> 
> 
> mvapich2 1.9:
> # OSU MPI Latency Test
> # Size            Latency (us)
> 0                         0.23
> 1                         0.23
> 2                         0.23
> 4                         0.23
> 8                         0.23
> 16                        0.28
> 32                        0.28
> 64                        0.39
> 128                       0.40
> 256                       0.40
> 512                       0.42
> 1024                      0.51
> 2048                      0.71
> 4096                      1.02
> 8192                      1.60
> 16384                     3.47
> 32768                     5.05
> 65536                     8.06
> 131072                   14.82
> 262144                   28.15
> 524288                   53.69
> 1048576                 127.47
> 2097152                 235.58
> 4194304                 683.90
> 
> 
> -Nathan
> 
> On Tue, Jan 07, 2014 at 06:23:13PM -0700, George Bosilca wrote:
>>   The local request is not correctly released, leading to assert in debug
>>   mode. This is because you avoid calling MCA_PML_BASE_RECV_REQUEST_FINI,
>>   fact that leaves the request in an ACTIVE state, condition carefully
>>   checked during the call to destructor.
>> 
>>   I attached a second patch that fixes the issue above, and implement a
>>   similar optimization for the blocking send.
>> 
>>   Unfortunately, this is not enough. The mca_pml_ob1_send_inline
>>   optimization is horribly wrong in a multithreaded case as it alter the
>>   send_sequence without storing it. If you create a gap in the send_sequence
>>   a deadlock will __definitively__ occur. I strongly suggest you turn off
>>   the mca_pml_ob1_send_inline optimization on the multithreaded case. All
>>   the others optimizations should be safe in all cases.
>> 
>>     George.
>> 
>>   On Jan 8, 2014, at 01:15 , Shamis, Pavel <sham...@ornl.gov> wrote:
>> 
>>> Overall it looks good. It would be helpful to validate performance
>>   numbers for other interconnects as well.
>>> -Pasha
>>> 
>>>> -----Original Message-----
>>>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Nathan
>>>> Hjelm
>>>> Sent: Tuesday, January 07, 2014 6:45 PM
>>>> To: Open MPI Developers List
>>>> Subject: [OMPI devel] RFC: OB1 optimizations
>>>> 
>>>> What: Push some ob1 optimizations to the trunk and 1.7.5.
>>>> 
>>>> What: This patch contains two optimizations:
>>>> 
>>>> - Introduce a fast send path for blocking send calls. This path uses
>>>>   the btl sendi function to put the data on the wire without the need
>>>>   for setting up a send request. In the case of btl/vader this can
>>>>   also avoid allocating/initializing a new fragment. With btl/vader
>>>>   this optimization improves small message latency by 50-200ns in
>>>>   ping-pong type benchmarks. Larger messages may take a small hit in
>>>>   the range of 10-20ns.
>>>> 
>>>> - Use a stack-allocated receive request for blocking recieves. This
>>>>   optimization saves the extra instructions associated with accessing
>>>>   the receive request free list. I was able to get another 50-200ns
>>>>   improvement in the small-message ping-pong with this optimization. I
>>>>   see no hit for larger messages.
>>>> 
>>>> When: These changes touch the critical path in ob1 and are targeted for
>>>> 1.7.5. As such I will set a moderately long timeout. Timeout set for
>>>> next Friday (Jan 17).
>>>> 
>>>> Some results from osu_latency on haswell:
>>>> 
>>>> hjelmn@cn143 pt2pt]$ mpirun -n 2 --bind-to core -mca btl vader,self
>>>> ./osu_latency
>>>> # OSU MPI Latency Test v4.0.1
>>>> # Size          Latency (us)
>>>> 0                       0.11
>>>> 1                       0.14
>>>> 2                       0.14
>>>> 4                       0.14
>>>> 8                       0.14
>>>> 16                      0.14
>>>> 32                      0.15
>>>> 64                      0.18
>>>> 128                     0.36
>>>> 256                     0.37
>>>> 512                     0.46
>>>> 1024                    0.56
>>>> 2048                    0.80
>>>> 4096                    1.12
>>>> 8192                    1.68
>>>> 16384                   2.98
>>>> 32768                   5.10
>>>> 65536                   8.12
>>>> 131072                 14.07
>>>> 262144                 25.30
>>>> 524288                 47.40
>>>> 1048576                91.71
>>>> 2097152               195.56
>>>> 4194304               487.05
>>>> 
>>>> 
>>>> Patch Attached.
>>>> 
>>>> -Nathan
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>>   _______________________________________________
>>   devel mailing list
>>   de...@open-mpi.org
>>   http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> <ob1_optimization_take3.patch>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] RFC: OB1 optimizations

Reply via email to