Jeff Squyres wrote:
On Apr 24, 2009, at 1:24 PM, Number Cruncher wrote:

The goal is admirable and a stalwart of good open source practice which
avoids "DLL-Hell". However, I simply don't understand how my compiler
can *ever* know the size of your opaque ompi_communicator_t?


I'm not enough of a linker expert to explain why, but before that commit, the linker did embed the size of ompi_communicator_t into your executable.

The problem is when you compile your application the handles like MPI_COMM_WORLD result in undefined symbols in the .o. So when you link the application it needs to resolve all those symbols. In this case it resolves them to what libmpi.so has them defined as which includes location and size. Since these are accessible by the application the structures are located in bss which means it overrides any other library location of such a variable. So now we have a handle that size and location is predefined after link time preventing updating either during runtime by linking in a different library.


--td
A simple way to verify this is to do the following:

- compile and link an MPI "hello world" application against Open MPI 1.2.x
- run the app, verify that it runs properly
- change your LD_LIBRARY_PATH and have your "hello world" app find Open MPI 1.3 / 1.3.1's libmpi.so
- run your app
- notice a bunch of messages something like the following (off the top of my head; I don't remember the exact message):

ompi_mpi_comm_world size difference between hello and libmpi.so; consider re-linking

So it's not the *compiler* that creates the problem -- who, I agree, does not know the definition of ompi_comnunicator_t. It's the *linker*, that knows the *size* of these objects (not necessarily the definition).

You can simulate this problem with a much smaller non-MPI test code, too. I'll admit that this problem kind of blind-sided us -- we never expected it because we initially had the same thoughts that you did: they're just pointer values that are resolved at run-time; why does the size of the back-end struct matter? Unfortunately, it does. :-(

mpi.h declares MPI_Comm as a pointer to as-yet-undefined "struct
ompi_communicator_t", ompi_mpi_comm_world is an external global of
as-yet-undefined "struct ompi_predefined_communicator_t". Then the
compiler must try and work out whether the "comm_predefined_t *" aliases
the "comm_t *". This isn't possible without full type information. In
general, only (void *) and (char *) can alias arbitrary types.

See
http://www.cellperformance.com/mike_acton/2006/06/understanding_strict_aliasing.html
for an excellent discussion on aliasing.


Understood.

> Hence, the solution of r20627: make a "dummy" type that is guaranteed to
> be larger than ompi_communicator_t -- ompi_predefined_communicator_t.
> It's actually a struct that *contains* ompi_communicator_t and then a
> fixed amount of padding at the end. Since MPI_COMM_WORLD will always be
> this ompi_predefined_communicator_t, we can ensure that its size stays
> constant, even if the size of ompi_communicator_t changes.  Hence, the
> size in the hello executable and libmpi.so will be the same. Happiness.

With respect, this feels a little bit like a hack. Who's to say your
future communicator won't need to be even bigger than the current
padding allows? - and then the assumptions made when linking against the
old predefined_t would no longer apply leading to corruption.


Definitely true.  Here's our rationale:

- As Terry mentioned, MPI requires that you can do stuff like:

    MPI_Comm foo = MPI_COMM_WORLD;

before calling MPI_INIT. This is where much of the evil comes from; MPI_COMM_WORLD has to be a compile-time constant; it can't be a global variable that is assigned a meaningful value during MPI_INIT (for example).

- The padding is gross, and we could certainly run out of space someday. However, this concern is mitigated by three points:

1. We tried to give ourselves enough padding that we shouldn't run out of space (admittedly, we can't predict the future, though, so this may still fail someday)

2. We probably didn't spell out the conditions well enough in our current documentation, but I believe that the ABI guarantees really only apply to the "stable" series -- e.g., 1.4.x. However, we know the features that are going into 1.3.x (which will then morph into the 1.4.x series), and we don't anticipate changing the sizes of our structures in 1.3.x and 1.4.x. 1.5.x is a different issue; we don't know yet if the structs will change size in that series -- and the ABI is ok to change at 1.5 (it would be nice if it does not, but I don't know if we've taken a hard-line guarantee that an app compiled for 1.3.x will work with 1.5.x, for example).

3. As Terry mentioned, if we do grow beyond the current padding, there is the gross hack of adding a pointer in the current padding to point to the rest of the struct. Gross, yes. Workable, yes. It's our failsafe in case we *have* to increase the size of a struct in the 1.3/1.4 series, but can't violate the ABI.

> The question is -- why do they only show up in gcc 4.1?  More
> specifically -- why do they *not* show up in other versions of gcc? Is
> it a gcc 4.1 compiler bug?

Older GCC weren't as strict about aliasing issues. Perhaps the latter
ones recognise that in this context (a function call with pointer to
non-local parameter), no optimisation would be possible anyway.


That sounds plausible.

>> The bad news is that the only work around I have is to insert (void *)
>> casts between (MPI_TYPENAME) and the address operator, e.g.:
>> #define MPI_COMM_WORLD (((MPI_Comm)(void *)&(ompi_mpi_comm_world)))
>
> Hmm.  I guess that's plausible, but ugly.

Actually, I think it's closer to what you've been arguing above: you're
insisting that the compiler blindly interpret &ompi_mpi_comm_world as a
pointer to some memory that really is equivalent to the other unknown
type communicator_t. Without the intermediate (void *), you're
suggesting the compiler could possibly do better optimization.

Ultimately, for internal use, the (void *) is bad, but from client code
with no knowledge of your types, it should be mandatory and tells the
compiler to make no assumptions about aliasing.


I think you're convincing me, but I need to think over this a little more.

I wonder if we should have a "smart" #define for MPI_COMM_WORLD that puts the (void*) in when compiling user codes, and doesn't include it when building OMPI itself (i.e., so that we can get the debugging benefit of proper type checking when building OMPI). We do have the OMPI_BUILDING macro in opal_config_bottom.h that could be used to switch the macro definition. Hrm...


Reply via email to