Re: [OMPI devel] RFC: [slightly] Optimize Fortran MPI_SEND / MPI_RECV

Jeff Squyres Sat, 7 Feb 2009 06:18:34 -0500

On Feb 4, 2009, at 6:05 PM, Eugene Loh wrote:

- Remove a function call from the critical performance path;possibly save a little latency
The only "benefit" is "possibly a little"? This is not at allcompelling. Is the hoped-for benefit measurable? I assume apingpong latency test over shared memory is the only hope you haveof observing any benefit. Have you attempted to measure this, or isthis benefit only conjecture?

When I wrote it, it was pure conjecture. But your mail shamed me intoactually testing. It turns out that there is about 10ns ofimprovement by removing the function call. Specifically, I testedwith the following:

- http://www.open-mpi.org/hg/hgwebdir.cgi/jsquyres/fortran/ has aWANT_C #define in both ompi/mpi/f77/send_f.c and recv_f.c. If youhand-edit these files and set it to 0 or 1, it'll do the layering on Cor not. The not-layering code is hacked up; it's just a proof ofconcept. For the non-layered versions, I literally copied the guts ofompi/mpi/c/send.c to ompi/mpi/f77/send_f.c and added the convert-integer-to-pointer macros. Ditto for recv.c/recv_f.c. So the C andF77 versions *should* be doing [almost] exactly the same thing.

- There's also a new top-level directory in that hg tree called "f90"that has a 0-byte latency test program. I copied the osu-latency.cprogram and turned the inner send/recv loop into fortran. Becausethere's so much latency jitter at these sizes, I changed the programto a) only run the 0 byte size and b) run the 0 byte size test 10,000times.

- I configured OMPI with: ./configure --prefix=/home/jsquyres/bogus --enable-mpirun-prefix-by-default --with-platform=optimized

- I ran on a single 4-core wolfdale-class machine that was otherwiseempty

- I ran with --mca mpi_paffinity_alone 1 --mca mpi_param_check 0 --mcabtl sm,self and saved stdout to a .csv file

- I then changed the WANT_C #define in both files, recompiled/reinstalled just the mpi_f77 library, and re-ran the same f90 testprogram (I did *NOT* recompile/relink the f90 test program -- the*only* thing that changed was the mpi_f77 library). I ran with thesame MPI params and saved the stdout to a different .csv file.

- I took both sets of output numbers and graphed the first ~500 ofthem in Excel; see attached (trying to graph all 10,000 made Excelchoke -- sigh)

End result: I guess I'm a little surprised that the difference is thatclear -- does a function call really take 10ns? I'm also surprisedthat the layered C version has significantly more jitter than the non-layered version; I can't really explain that. I'd welcome anyone elsereplicating experiment and/or eyeballing my code to make sure I didn'tbork something up.

Drawback
- Duplicate some code (but this code rarely/never changes)
It's still code bloat.
- May violate MPI profiling libraries that assume that the FortranMPI API functions call the C MPI API functions
I'm not real familiar with the issues here, but this strikes me as aserious drawback.

I think it would be pretty easy to have a compile time and/or run-timeswitch to decide which to use.


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] RFC: [slightly] Optimize Fortran MPI_SEND / MPI_RECV

Reply via email to