Re: [OMPI devel] RFC: OB1 optimizations

George Bosilca Fri, 10 Jan 2014 08:24:15 -0500 (EST)

Nathan,

When you get access to the machine it might be interesting to show not only the 
after-patch performance but also what the trunk is getting on the same 
architecture.


  George.

On Jan 8, 2014, at 18:09 , Nathan Hjelm <hje...@lanl.gov> wrote:

> Yeah. Its hard to say what the results will look like on Haswell. I
> expect they should show some improvement from George's change but we
> won't know until I can get to a Haswell node. Hopefully one becomes
> available today.
> 
> -Nathan
> 
> On Wed, Jan 08, 2014 at 08:59:34AM -0800, Paul Hargrove wrote:
>>   Nevermind, since Nathan just clarified that the results are not
>>   comparable.
>> 
>>   -Paul [Sent from my phone]
>> 
>>   On Jan 8, 2014 8:58 AM, "Paul Hargrove" <phhargr...@lbl.gov> wrote:
>> 
>>     Interestingly enough the 4MB latency actually improved significantly
>>     relative to the initial numbers.
>> 
>>     -Paul [Sent from my phone]
>> 
>>     On Jan 8, 2014 8:50 AM, "George Bosilca" <bosi...@icl.utk.edu> wrote:
>> 
>>       These results are way worst that the one you send on your previous
>>       email? What is the reason?
>> 
>>         George.
>> 
>>       On Jan 8, 2014, at 17:33 , Nathan Hjelm <hje...@lanl.gov> wrote:
>> 
>>> Ah, good catch. A new version is attached that should eliminate the
>>       race
>>> window for the multi-threaded case. Performance numbers are still
>>> looking really good. We beat mvapich2 in the small message ping-pong
>>       by
>>> a good margin. See the results below. The large message latency
>>> difference for large messages is probably due to a difference in the
>>       max
>>> send size for vader vs mvapich.
>>> 
>>> To answer Pasha's question. I don't see a noticiable difference in
>>> performance for btl's with no sendi function (this includes
>>> ugni). OpenIB should get a boost. I will test that once I get an
>>> allocation.
>>> 
>>> CPU: Xeon E5-2670 @ 2.60 GHz
>>> 
>>> Open MPI (-mca btl vader,self):
>>> # OSU MPI Latency Test v4.1
>>> # Size          Latency (us)
>>> 0                       0.17
>>> 1                       0.19
>>> 2                       0.19
>>> 4                       0.19
>>> 8                       0.19
>>> 16                      0.19
>>> 32                      0.19
>>> 64                      0.40
>>> 128                     0.40
>>> 256                     0.43
>>> 512                     0.52
>>> 1024                    0.67
>>> 2048                    0.94
>>> 4096                    1.44
>>> 8192                    2.04
>>> 16384                   3.47
>>> 32768                   6.10
>>> 65536                   9.38
>>> 131072                 16.47
>>> 262144                 29.63
>>> 524288                 54.81
>>> 1048576               106.63
>>> 2097152               206.84
>>> 4194304               421.26
>>> 
>>> 
>>> mvapich2 1.9:
>>> # OSU MPI Latency Test
>>> # Size            Latency (us)
>>> 0                         0.23
>>> 1                         0.23
>>> 2                         0.23
>>> 4                         0.23
>>> 8                         0.23
>>> 16                        0.28
>>> 32                        0.28
>>> 64                        0.39
>>> 128                       0.40
>>> 256                       0.40
>>> 512                       0.42
>>> 1024                      0.51
>>> 2048                      0.71
>>> 4096                      1.02
>>> 8192                      1.60
>>> 16384                     3.47
>>> 32768                     5.05
>>> 65536                     8.06
>>> 131072                   14.82
>>> 262144                   28.15
>>> 524288                   53.69
>>> 1048576                 127.47
>>> 2097152                 235.58
>>> 4194304                 683.90
>>> 
>>> 
>>> -Nathan
>>> 
>>> On Tue, Jan 07, 2014 at 06:23:13PM -0700, George Bosilca wrote:
>>>>  The local request is not correctly released, leading to assert in
>>       debug
>>>>  mode. This is because you avoid calling
>>       MCA_PML_BASE_RECV_REQUEST_FINI,
>>>>  fact that leaves the request in an ACTIVE state, condition
>>       carefully
>>>>  checked during the call to destructor.
>>>> 
>>>>  I attached a second patch that fixes the issue above, and
>>       implement a
>>>>  similar optimization for the blocking send.
>>>> 
>>>>  Unfortunately, this is not enough. The mca_pml_ob1_send_inline
>>>>  optimization is horribly wrong in a multithreaded case as it
>>       alter the
>>>>  send_sequence without storing it. If you create a gap in the
>>       send_sequence
>>>>  a deadlock will __definitively__ occur. I strongly suggest you
>>       turn off
>>>>  the mca_pml_ob1_send_inline optimization on the multithreaded
>>       case. All
>>>>  the others optimizations should be safe in all cases.
>>>> 
>>>>    George.
>>>> 
>>>>  On Jan 8, 2014, at 01:15 , Shamis, Pavel <sham...@ornl.gov>
>>       wrote:
>>>> 
>>>>> Overall it looks good. It would be helpful to validate performance
>>>>  numbers for other interconnects as well.
>>>>> -Pasha
>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of
>>       Nathan
>>>>>> Hjelm
>>>>>> Sent: Tuesday, January 07, 2014 6:45 PM
>>>>>> To: Open MPI Developers List
>>>>>> Subject: [OMPI devel] RFC: OB1 optimizations
>>>>>> 
>>>>>> What: Push some ob1 optimizations to the trunk and 1.7.5.
>>>>>> 
>>>>>> What: This patch contains two optimizations:
>>>>>> 
>>>>>> - Introduce a fast send path for blocking send calls. This path
>>       uses
>>>>>>  the btl sendi function to put the data on the wire without the
>>       need
>>>>>>  for setting up a send request. In the case of btl/vader this
>>       can
>>>>>>  also avoid allocating/initializing a new fragment. With
>>       btl/vader
>>>>>>  this optimization improves small message latency by 50-200ns in
>>>>>>  ping-pong type benchmarks. Larger messages may take a small hit
>>       in
>>>>>>  the range of 10-20ns.
>>>>>> 
>>>>>> - Use a stack-allocated receive request for blocking recieves.
>>       This
>>>>>>  optimization saves the extra instructions associated with
>>       accessing
>>>>>>  the receive request free list. I was able to get another
>>       50-200ns
>>>>>>  improvement in the small-message ping-pong with this
>>       optimization. I
>>>>>>  see no hit for larger messages.
>>>>>> 
>>>>>> When: These changes touch the critical path in ob1 and are
>>       targeted for
>>>>>> 1.7.5. As such I will set a moderately long timeout. Timeout set
>>       for
>>>>>> next Friday (Jan 17).
>>>>>> 
>>>>>> Some results from osu_latency on haswell:
>>>>>> 
>>>>>> hjelmn@cn143 pt2pt]$ mpirun -n 2 --bind-to core -mca btl
>>       vader,self
>>>>>> ./osu_latency
>>>>>> # OSU MPI Latency Test v4.0.1
>>>>>> # Size          Latency (us)
>>>>>> 0                       0.11
>>>>>> 1                       0.14
>>>>>> 2                       0.14
>>>>>> 4                       0.14
>>>>>> 8                       0.14
>>>>>> 16                      0.14
>>>>>> 32                      0.15
>>>>>> 64                      0.18
>>>>>> 128                     0.36
>>>>>> 256                     0.37
>>>>>> 512                     0.46
>>>>>> 1024                    0.56
>>>>>> 2048                    0.80
>>>>>> 4096                    1.12
>>>>>> 8192                    1.68
>>>>>> 16384                   2.98
>>>>>> 32768                   5.10
>>>>>> 65536                   8.12
>>>>>> 131072                 14.07
>>>>>> 262144                 25.30
>>>>>> 524288                 47.40
>>>>>> 1048576                91.71
>>>>>> 2097152               195.56
>>>>>> 4194304               487.05
>>>>>> 
>>>>>> 
>>>>>> Patch Attached.
>>>>>> 
>>>>>> -Nathan
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>>  _______________________________________________
>>>>  devel mailing list
>>>>  de...@open-mpi.org
>>>>  http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> 
>>       
>> <ob1_optimization_take3.patch>_______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>>       _______________________________________________
>>       devel mailing list
>>       de...@open-mpi.org
>>       http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] RFC: OB1 optimizations

Reply via email to