Many thanks for the informative explanation, Jeff. I appreciate this
issue has been the cause of some deliberation!
This was the changeset where we did the ABI fixes -- ensuring that if
you compile/link against Open MPI vA.B.C, you will be able to just
change your LD_LIBRARY_PATH or replace libmpi.so when upgrading to Open
MPI vA.B.x.
But -- to make a long
story short -- the linker will embed the length of the
ompi_communicator_t in the hello executable. Among other reasons, it's
so that pointer math can be computed properly.
The goal is admirable and a stalwart of good open source practice which
avoids "DLL-Hell". However, I simply don't understand how my compiler
can *ever* know the size of your opaque ompi_communicator_t?
mpi.h declares MPI_Comm as a pointer to as-yet-undefined "struct
ompi_communicator_t", ompi_mpi_comm_world is an external global of
as-yet-undefined "struct ompi_predefined_communicator_t". Then the
compiler must try and work out whether the "comm_predefined_t *" aliases
the "comm_t *". This isn't possible without full type information. In
general, only (void *) and (char *) can alias arbitrary types.
See
http://www.cellperformance.com/mike_acton/2006/06/understanding_strict_aliasing.html
for an excellent discussion on aliasing.
However, let's assume
the user upgrades to OMPI vA.B.(C+1) in place, meaning that libmpi.so is
replaced with a new version. If we've changed the back-end
ompi_communicator_t struct (e.g., added a member), then the size that is
included in the hello executable no longer matches the size that is in
libmpi.so -- and Bad Things can (and do) happen.
I don't get how the linker can affect the actual object code generated
for my hello.o, except via relocation of symbols. If none of my code
makes use of, or even *knows* about your internal struct definitions,
and only holds pointers to them, then I can't do any pointer arithmetic
or anything that requires internal information; I get: "invalid use of
undefined type ‘struct XXXX'" or similar.
I can hold a simple pointer to your private object, but this should
maintain ABI compatibility; isn't that what the PIMPL C++ idiom is all
about and widely used (e.g. Qt Toolkit). The pointer-to-global gets
relocated at shared library load time.
Hence, the solution of r20627: make a "dummy" type that is guaranteed to
be larger than ompi_communicator_t -- ompi_predefined_communicator_t.
It's actually a struct that *contains* ompi_communicator_t and then a
fixed amount of padding at the end. Since MPI_COMM_WORLD will always be
this ompi_predefined_communicator_t, we can ensure that its size stays
constant, even if the size of ompi_communicator_t changes. Hence, the
size in the hello executable and libmpi.so will be the same. Happiness.
With respect, this feels a little bit like a hack. Who's to say your
future communicator won't need to be even bigger than the current
padding allows? - and then the assumptions made when linking against the
old predefined_t would no longer apply leading to corruption.
The compiler complains because ompi_mpi_comm_world is declared as an
"extern struct ompi_predefined_communicator_t" but the type is
incomplete, so it can't tell whether the cast is a permissible
almost-the-same type pun (e.g. an "int" can alias an "unsigned").
I think this is potentially a serious performance issue for anyone using
OpenMPI in a C++ environment, and the profuse warnings preclude it's use
in our build system.
I agree that the warnings are Bad.
The question is -- why do they only show up in gcc 4.1? More
specifically -- why do they *not* show up in other versions of gcc? Is
it a gcc 4.1 compiler bug?
Older GCC weren't as strict about aliasing issues. Perhaps the latter
ones recognise that in this context (a function call with pointer to
non-local parameter), no optimisation would be possible anyway.
The bad news is that the only work around I have is to insert (void *)
casts between (MPI_TYPENAME) and the address operator, e.g.:
#define MPI_COMM_WORLD (((MPI_Comm)(void *)&(ompi_mpi_comm_world)))
Hmm. I guess that's plausible, but ugly.
Actually, I think it's closer to what you've been arguing above: you're
insisting that the compiler blindly interpret &ompi_mpi_comm_world as a
pointer to some memory that really is equivalent to the other unknown
type communicator_t. Without the intermediate (void *), you're
suggesting the compiler could possibly do better optimization.
Ultimately, for internal use, the (void *) is bad, but from client code
with no knowledge of your types, it should be mandatory and tells the
compiler to make no assumptions about aliasing.
An alternative might be to make the full type definition available by
#including some of the internal developer headers such as
ompi/communicator/communicator.h
I'm not sure what the right solution is, but that is not attractive. :-)
Completely agree (on both points!).
In summary, I still don't see why holding pointers to opaque types is
not ABI-compatible, and would suggest the (void *) compiler hint in the
meantime.
Regards,
Simon