The local request is not correctly released, leading to assert in debug mode. 
This is because you avoid calling MCA_PML_BASE_RECV_REQUEST_FINI, fact that 
leaves the request in an ACTIVE state, condition carefully checked during the 
call to destructor.

I attached a second patch that fixes the issue above, and implement a similar 
optimization for the blocking send.

Unfortunately, this is not enough. The mca_pml_ob1_send_inline optimization is 
horribly wrong in a multithreaded case as it alter the send_sequence without 
storing it. If you create a gap in the send_sequence a deadlock will 
__definitively__ occur. I strongly suggest you turn off the 
mca_pml_ob1_send_inline optimization on the multithreaded case. All the others 
optimizations should be safe in all cases.

  George.

Attachment: ob1_optimization_take2.patch
Description: Binary data


On Jan 8, 2014, at 01:15 , Shamis, Pavel <sham...@ornl.gov> wrote:

> Overall it looks good. It would be helpful to validate performance numbers 
> for other interconnects as well.
> -Pasha
> 
>> -----Original Message-----
>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Nathan
>> Hjelm
>> Sent: Tuesday, January 07, 2014 6:45 PM
>> To: Open MPI Developers List
>> Subject: [OMPI devel] RFC: OB1 optimizations
>> 
>> What: Push some ob1 optimizations to the trunk and 1.7.5.
>> 
>> What: This patch contains two optimizations:
>> 
>>  - Introduce a fast send path for blocking send calls. This path uses
>>    the btl sendi function to put the data on the wire without the need
>>    for setting up a send request. In the case of btl/vader this can
>>    also avoid allocating/initializing a new fragment. With btl/vader
>>    this optimization improves small message latency by 50-200ns in
>>    ping-pong type benchmarks. Larger messages may take a small hit in
>>    the range of 10-20ns.
>> 
>>  - Use a stack-allocated receive request for blocking recieves. This
>>    optimization saves the extra instructions associated with accessing
>>    the receive request free list. I was able to get another 50-200ns
>>    improvement in the small-message ping-pong with this optimization. I
>>    see no hit for larger messages.
>> 
>> When: These changes touch the critical path in ob1 and are targeted for
>> 1.7.5. As such I will set a moderately long timeout. Timeout set for
>> next Friday (Jan 17).
>> 
>> Some results from osu_latency on haswell:
>> 
>> hjelmn@cn143 pt2pt]$ mpirun -n 2 --bind-to core -mca btl vader,self
>> ./osu_latency
>> # OSU MPI Latency Test v4.0.1
>> # Size          Latency (us)
>> 0                       0.11
>> 1                       0.14
>> 2                       0.14
>> 4                       0.14
>> 8                       0.14
>> 16                      0.14
>> 32                      0.15
>> 64                      0.18
>> 128                     0.36
>> 256                     0.37
>> 512                     0.46
>> 1024                    0.56
>> 2048                    0.80
>> 4096                    1.12
>> 8192                    1.68
>> 16384                   2.98
>> 32768                   5.10
>> 65536                   8.12
>> 131072                 14.07
>> 262144                 25.30
>> 524288                 47.40
>> 1048576                91.71
>> 2097152               195.56
>> 4194304               487.05
>> 
>> 
>> Patch Attached.
>> 
>> -Nathan
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to