On Feb 4, 2009, at 6:05 PM, Eugene Loh wrote:
- Remove a function call from the critical performance path;
possibly save a little latency
The only "benefit" is "possibly a little"? This is not at all
compelling. Is the hoped-for benefit measurable? I assume a
pingpong latency test over shared memory is the only hope you have
of observing any benefit. Have you attempted to measure this, or is
this benefit only conjecture?
When I wrote it, it was pure conjecture. But your mail shamed me into
actually testing. It turns out that there is about 10ns of
improvement by removing the function call. Specifically, I tested
with the following:
- http://www.open-mpi.org/hg/hgwebdir.cgi/jsquyres/fortran/ has a
WANT_C #define in both ompi/mpi/f77/send_f.c and recv_f.c. If you
hand-edit these files and set it to 0 or 1, it'll do the layering on C
or not. The not-layering code is hacked up; it's just a proof of
concept. For the non-layered versions, I literally copied the guts of
ompi/mpi/c/send.c to ompi/mpi/f77/send_f.c and added the convert-
integer-to-pointer macros. Ditto for recv.c/recv_f.c. So the C and
F77 versions *should* be doing [almost] exactly the same thing.
- There's also a new top-level directory in that hg tree called "f90"
that has a 0-byte latency test program. I copied the osu-latency.c
program and turned the inner send/recv loop into fortran. Because
there's so much latency jitter at these sizes, I changed the program
to a) only run the 0 byte size and b) run the 0 byte size test 10,000
times.
- I configured OMPI with: ./configure --prefix=/home/jsquyres/bogus --
enable-mpirun-prefix-by-default --with-platform=optimized
- I ran on a single 4-core wolfdale-class machine that was otherwise
empty
- I ran with --mca mpi_paffinity_alone 1 --mca mpi_param_check 0 --mca
btl sm,self and saved stdout to a .csv file
- I then changed the WANT_C #define in both files, recompiled/
reinstalled just the mpi_f77 library, and re-ran the same f90 test
program (I did *NOT* recompile/relink the f90 test program -- the
*only* thing that changed was the mpi_f77 library). I ran with the
same MPI params and saved the stdout to a different .csv file.
- I took both sets of output numbers and graphed the first ~500 of
them in Excel; see attached (trying to graph all 10,000 made Excel
choke -- sigh)
End result: I guess I'm a little surprised that the difference is that
clear -- does a function call really take 10ns? I'm also surprised
that the layered C version has significantly more jitter than the non-
layered version; I can't really explain that. I'd welcome anyone else
replicating experiment and/or eyeballing my code to make sure I didn't
bork something up.
Drawback
- Duplicate some code (but this code rarely/never changes)
It's still code bloat.
- May violate MPI profiling libraries that assume that the Fortran
MPI API functions call the C MPI API functions
I'm not real familiar with the issues here, but this strikes me as a
serious drawback.
I think it would be pretty easy to have a compile time and/or run-time
switch to decide which to use.
--
Jeff Squyres
Cisco Systems