On Feb 4, 2009, at 6:05 PM, Eugene Loh wrote:

- Remove a function call from the critical performance path; possibly save a little latency

The only "benefit" is "possibly a little"? This is not at all compelling. Is the hoped-for benefit measurable? I assume a pingpong latency test over shared memory is the only hope you have of observing any benefit. Have you attempted to measure this, or is this benefit only conjecture?

When I wrote it, it was pure conjecture. But your mail shamed me into actually testing. It turns out that there is about 10ns of improvement by removing the function call. Specifically, I tested with the following:

- http://www.open-mpi.org/hg/hgwebdir.cgi/jsquyres/fortran/ has a WANT_C #define in both ompi/mpi/f77/send_f.c and recv_f.c. If you hand-edit these files and set it to 0 or 1, it'll do the layering on C or not. The not-layering code is hacked up; it's just a proof of concept. For the non-layered versions, I literally copied the guts of ompi/mpi/c/send.c to ompi/mpi/f77/send_f.c and added the convert- integer-to-pointer macros. Ditto for recv.c/recv_f.c. So the C and F77 versions *should* be doing [almost] exactly the same thing.

- There's also a new top-level directory in that hg tree called "f90" that has a 0-byte latency test program. I copied the osu-latency.c program and turned the inner send/recv loop into fortran. Because there's so much latency jitter at these sizes, I changed the program to a) only run the 0 byte size and b) run the 0 byte size test 10,000 times.

- I configured OMPI with: ./configure --prefix=/home/jsquyres/bogus -- enable-mpirun-prefix-by-default --with-platform=optimized

- I ran on a single 4-core wolfdale-class machine that was otherwise empty

- I ran with --mca mpi_paffinity_alone 1 --mca mpi_param_check 0 --mca btl sm,self and saved stdout to a .csv file

- I then changed the WANT_C #define in both files, recompiled/ reinstalled just the mpi_f77 library, and re-ran the same f90 test program (I did *NOT* recompile/relink the f90 test program -- the *only* thing that changed was the mpi_f77 library). I ran with the same MPI params and saved the stdout to a different .csv file.

- I took both sets of output numbers and graphed the first ~500 of them in Excel; see attached (trying to graph all 10,000 made Excel choke -- sigh)

End result: I guess I'm a little surprised that the difference is that clear -- does a function call really take 10ns? I'm also surprised that the layered C version has significantly more jitter than the non- layered version; I can't really explain that. I'd welcome anyone else replicating experiment and/or eyeballing my code to make sure I didn't bork something up.

Drawback
- Duplicate some code (but this code rarely/never changes)

It's still code bloat.

- May violate MPI profiling libraries that assume that the Fortran MPI API functions call the C MPI API functions

I'm not real familiar with the issues here, but this strikes me as a serious drawback.

I think it would be pretty easy to have a compile time and/or run-time switch to decide which to use.

--
Jeff Squyres
Cisco Systems

Reply via email to