Re: [OMPI devel] predefined ompi_t types break strict-aliasing rules

Terry Dontje Mon, 27 Apr 2009 07:26:25 -0400

Jeff Squyres wrote:

On Apr 24, 2009, at 1:24 PM, Number Cruncher wrote:
The goal is admirable and a stalwart of good open source practice which
avoids "DLL-Hell". However, I simply don't understand how my compiler
can *ever* know the size of your opaque ompi_communicator_t?
I'm not enough of a linker expert to explain why, but before thatcommit, the linker did embed the size of ompi_communicator_t into yourexecutable.

The problem is when you compile your application the handles likeMPI_COMM_WORLD result in undefined symbols in the .o. So when you linkthe application it needs to resolve all those symbols. In this case itresolves them to what libmpi.so has them defined as which includeslocation and size. Since these are accessible by the application thestructures are located in bss which means it overrides any other librarylocation of such a variable. So now we have a handle that size andlocation is predefined after link time preventing updating either duringruntime by linking in a different library.



--td

A simple way to verify this is to do the following:
- compile and link an MPI "hello world" application against Open MPI1.2.x
- run the app, verify that it runs properly
- change your LD_LIBRARY_PATH and have your "hello world" app findOpen MPI 1.3 / 1.3.1's libmpi.so
- run your app
- notice a bunch of messages something like the following (off the topof my head; I don't remember the exact message):
ompi_mpi_comm_world size difference between hello and libmpi.so;consider re-linking
So it's not the *compiler* that creates the problem -- who, I agree,does not know the definition of ompi_comnunicator_t. It's the*linker*, that knows the *size* of these objects (not necessarily thedefinition).
You can simulate this problem with a much smaller non-MPI test code,too. I'll admit that this problem kind of blind-sided us -- we neverexpected it because we initially had the same thoughts that you did:they're just pointer values that are resolved at run-time; why doesthe size of the back-end struct matter? Unfortunately, it does. :-(
mpi.h declares MPI_Comm as a pointer to as-yet-undefined "struct
ompi_communicator_t", ompi_mpi_comm_world is an external global of
as-yet-undefined "struct ompi_predefined_communicator_t". Then the
compiler must try and work out whether the "comm_predefined_t *" aliases
the "comm_t *". This isn't possible without full type information. In
general, only (void *) and (char *) can alias arbitrary types.

See
http://www.cellperformance.com/mike_acton/2006/06/understanding_strict_aliasing.html
for an excellent discussion on aliasing.
Understood.
> Hence, the solution of r20627: make a "dummy" type that isguaranteed to
> be larger than ompi_communicator_t -- ompi_predefined_communicator_t.
> It's actually a struct that *contains* ompi_communicator_t and then a
> fixed amount of padding at the end. Since MPI_COMM_WORLD willalways be
> this ompi_predefined_communicator_t, we can ensure that its size stays
> constant, even if the size of ompi_communicator_t changes.  Hence, the
> size in the hello executable and libmpi.so will be the same.Happiness.
With respect, this feels a little bit like a hack. Who's to say your
future communicator won't need to be even bigger than the current
padding allows? - and then the assumptions made when linking against the
old predefined_t would no longer apply leading to corruption.
Definitely true.  Here's our rationale:

- As Terry mentioned, MPI requires that you can do stuff like:

    MPI_Comm foo = MPI_COMM_WORLD;
before calling MPI_INIT. This is where much of the evil comes from;MPI_COMM_WORLD has to be a compile-time constant; it can't be a globalvariable that is assigned a meaningful value during MPI_INIT (forexample).
- The padding is gross, and we could certainly run out of spacesomeday. However, this concern is mitigated by three points:
1. We tried to give ourselves enough padding that we shouldn't runout of space (admittedly, we can't predict the future, though, so thismay still fail someday)
2. We probably didn't spell out the conditions well enough in ourcurrent documentation, but I believe that the ABI guarantees reallyonly apply to the "stable" series -- e.g., 1.4.x. However, we knowthe features that are going into 1.3.x (which will then morph into the1.4.x series), and we don't anticipate changing the sizes of ourstructures in 1.3.x and 1.4.x. 1.5.x is a different issue; we don'tknow yet if the structs will change size in that series -- and the ABIis ok to change at 1.5 (it would be nice if it does not, but I don'tknow if we've taken a hard-line guarantee that an app compiled for1.3.x will work with 1.5.x, for example).
3. As Terry mentioned, if we do grow beyond the current padding,there is the gross hack of adding a pointer in the current padding topoint to the rest of the struct. Gross, yes. Workable, yes. It'sour failsafe in case we *have* to increase the size of a struct in the1.3/1.4 series, but can't violate the ABI.
> The question is -- why do they only show up in gcc 4.1?  More
> specifically -- why do they *not* show up in other versions ofgcc? Is
> it a gcc 4.1 compiler bug?

Older GCC weren't as strict about aliasing issues. Perhaps the latter
ones recognise that in this context (a function call with pointer to
non-local parameter), no optimisation would be possible anyway.
That sounds plausible.
>> The bad news is that the only work around I have is to insert(void *)
>> casts between (MPI_TYPENAME) and the address operator, e.g.:
>> #define MPI_COMM_WORLD (((MPI_Comm)(void *)&(ompi_mpi_comm_world)))
>
> Hmm.  I guess that's plausible, but ugly.

Actually, I think it's closer to what you've been arguing above: you're
insisting that the compiler blindly interpret &ompi_mpi_comm_world as a
pointer to some memory that really is equivalent to the other unknown
type communicator_t. Without the intermediate (void *), you're
suggesting the compiler could possibly do better optimization.

Ultimately, for internal use, the (void *) is bad, but from client code
with no knowledge of your types, it should be mandatory and tells the
compiler to make no assumptions about aliasing.
I think you're convincing me, but I need to think over this a littlemore.
I wonder if we should have a "smart" #define for MPI_COMM_WORLD thatputs the (void*) in when compiling user codes, and doesn't include itwhen building OMPI itself (i.e., so that we can get the debuggingbenefit of proper type checking when building OMPI). We do have theOMPI_BUILDING macro in opal_config_bottom.h that could be used toswitch the macro definition. Hrm...

Re: [OMPI devel] predefined ompi_t types break strict-aliasing rules

Reply via email to