The local request is not correctly released, leading to assert in debug mode. This is because you avoid calling MCA_PML_BASE_RECV_REQUEST_FINI, fact that leaves the request in an ACTIVE state, condition carefully checked during the call to destructor.
I attached a second patch that fixes the issue above, and implement a similar optimization for the blocking send. Unfortunately, this is not enough. The mca_pml_ob1_send_inline optimization is horribly wrong in a multithreaded case as it alter the send_sequence without storing it. If you create a gap in the send_sequence a deadlock will __definitively__ occur. I strongly suggest you turn off the mca_pml_ob1_send_inline optimization on the multithreaded case. All the others optimizations should be safe in all cases. George.
ob1_optimization_take2.patch
Description: Binary data
On Jan 8, 2014, at 01:15 , Shamis, Pavel <sham...@ornl.gov> wrote: > Overall it looks good. It would be helpful to validate performance numbers > for other interconnects as well. > -Pasha > >> -----Original Message----- >> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Nathan >> Hjelm >> Sent: Tuesday, January 07, 2014 6:45 PM >> To: Open MPI Developers List >> Subject: [OMPI devel] RFC: OB1 optimizations >> >> What: Push some ob1 optimizations to the trunk and 1.7.5. >> >> What: This patch contains two optimizations: >> >> - Introduce a fast send path for blocking send calls. This path uses >> the btl sendi function to put the data on the wire without the need >> for setting up a send request. In the case of btl/vader this can >> also avoid allocating/initializing a new fragment. With btl/vader >> this optimization improves small message latency by 50-200ns in >> ping-pong type benchmarks. Larger messages may take a small hit in >> the range of 10-20ns. >> >> - Use a stack-allocated receive request for blocking recieves. This >> optimization saves the extra instructions associated with accessing >> the receive request free list. I was able to get another 50-200ns >> improvement in the small-message ping-pong with this optimization. I >> see no hit for larger messages. >> >> When: These changes touch the critical path in ob1 and are targeted for >> 1.7.5. As such I will set a moderately long timeout. Timeout set for >> next Friday (Jan 17). >> >> Some results from osu_latency on haswell: >> >> hjelmn@cn143 pt2pt]$ mpirun -n 2 --bind-to core -mca btl vader,self >> ./osu_latency >> # OSU MPI Latency Test v4.0.1 >> # Size Latency (us) >> 0 0.11 >> 1 0.14 >> 2 0.14 >> 4 0.14 >> 8 0.14 >> 16 0.14 >> 32 0.15 >> 64 0.18 >> 128 0.36 >> 256 0.37 >> 512 0.46 >> 1024 0.56 >> 2048 0.80 >> 4096 1.12 >> 8192 1.68 >> 16384 2.98 >> 32768 5.10 >> 65536 8.12 >> 131072 14.07 >> 262144 25.30 >> 524288 47.40 >> 1048576 91.71 >> 2097152 195.56 >> 4194304 487.05 >> >> >> Patch Attached. >> >> -Nathan > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel