On Apr 24, 2009, at 8:08 AM, Number Cruncher wrote:
Compiler (g++ 4.1.2) output when using optimization (-O2):
/opt/cfs/include/openmpi/ompi/mpi/cxx/datatype.h: In constructor
‘MPI::Datatype::Datatype()’:
/opt/cfs/include/openmpi/ompi/mpi/cxx/datatype.h:68: warning:
type-punning to incomplete type might break strict-aliasing rules
These lines are mostly of the form:
inline Datatype() : mpi_datatype(MPI_DATATYPE_NULL) { }
Assumedly the problem is with MPI_DATATYPE_NULL (i.e., MPI_*_NULL for
the other types). More below.
Without the C++ bindings, I still get:
helloMPI.cpp: In function ‘int main(int, char**)’:
helloMPI.cpp:8: warning: type-punning to incomplete type might break
strict-aliasing rules
The odd thing is that these warnings only seem to appear in gcc 4.1
output. They do not appear in gcc 3.4.6 and 4.4.0 (I don't know if
they appear in intermediate versions or not).
I've followed the discussion at ompi/communicator/communicator.h and
looked at the major changeset at
https://svn.open-mpi.org/trac/ompi/changeset/20627 .
This was the changeset where we did the ABI fixes -- ensuring that if
you compile/link against Open MPI vA.B.C, you will be able to just
change your LD_LIBRARY_PATH or replace libmpi.so when upgrading to
Open MPI vA.B.x.
The issue here is surprisingly complex; it took us a long time to come
up with the current solution.
For the purpose of this discussion, assume OMPI vA.B.C is built with
shared libraries (libmpi.so and friends), and a user application
(hello.c) compiles and links against them.
As you noted, OMPI's handles are all pointers -- e.g., MPI_Comm is
(ompi_communicator_t*). We purposefully do not include the struct
definition of ompi_communicator_t in mpi.h because it's private -- we
don't want users to go poking around in there. But -- to make a long
story short -- the linker will embed the length of the
ompi_communicator_t in the hello executable. Among other reasons,
it's so that pointer math can be computed properly. However, let's
assume the user upgrades to OMPI vA.B.(C+1) in place, meaning that
libmpi.so is replaced with a new version. If we've changed the back-
end ompi_communicator_t struct (e.g., added a member), then the size
that is included in the hello executable no longer matches the size
that is in libmpi.so -- and Bad Things can (and do) happen.
Hence, the solution of r20627: make a "dummy" type that is guaranteed
to be larger than ompi_communicator_t --
ompi_predefined_communicator_t. It's actually a struct that
*contains* ompi_communicator_t and then a fixed amount of padding at
the end. Since MPI_COMM_WORLD will always be this
ompi_predefined_communicator_t, we can ensure that its size stays
constant, even if the size of ompi_communicator_t changes. Hence, the
size in the hello executable and libmpi.so will be the same. Happiness.
Additionally, casing the MPI_COMM_WORLD instance (in the back end: a
variable instance named ompi_mpi_comm_world) from
ompi_predefined_communicator_t to ompi_communicator_t is valid,
because the first bytes in the struct *are* the same type -- a poor
man's C++ inheritance system, per se (or simply using the correct
_predefined struct member).
Note that as you can probably guess from the struct names, this *only*
affects the MPI pre-defined handles (e.g., MPI_COMM_WORLD). It does
not affect user-defined communicators. Hence, pre-defined handles
take up some extra padding space at the end, but user-defined
communicators do not.
That's the rationale for why we did what we did.
The problem is that, with optimization enabled, the compiler can
assume
that "an object of one type is assumed never to reside at the same
address as an object of a different type, unless the types are almost
the same" (gcc info page). In this case ompi_predefined_*_t are not
fully defined by the time the C++ compiler expands the macro
MPI_COMM_WORLD to:
((ompi_communicator_t *)&(ompi_mpi_comm_world))
Per above, this was intentional. Hope my explanations made sense.
The compiler complains because ompi_mpi_comm_world is declared as an
"extern struct ompi_predefined_communicator_t" but the type is
incomplete, so it can't tell whether the cast is a permissible
almost-the-same type pun (e.g. an "int" can alias an "unsigned").
I think this is potentially a serious performance issue for anyone
using
OpenMPI in a C++ environment, and the profuse warnings preclude it's
use
in our build system.
I agree that the warnings are Bad.
The question is -- why do they only show up in gcc 4.1? More
specifically -- why do they *not* show up in other versions of gcc?
Is it a gcc 4.1 compiler bug?
The bad news is that the only work around I have is to insert (void *)
casts between (MPI_TYPENAME) and the address operator, e.g.:
#define MPI_COMM_WORLD (((MPI_Comm)(void *)&(ompi_mpi_comm_world)))
Hmm. I guess that's plausible, but ugly.
An alternative might be to make the full type definition available by
#including some of the internal developer headers such as
ompi/communicator/communicator.h
I'm not sure what the right solution is, but that is not
attractive. :-)
--
Jeff Squyres
Cisco Systems