Followup for the list... a bit of explanation of Nathan's problem about shared libraries and unresolved symbols.

Short version:
--------------

It's an OMPI bug when built as a shared library (not an issue for static libraries). The fix is straightforward, but involves grunt work. I'll try to get a student to do it RSN.

Long version:
-------------

What's happening is that we are not linking OMPI components against the opal/orte/ompi libraries. As such, we are exploiting the fact that when they are dlopened by a standalone application (e.g., a.out), the Libtool portable version of dlopen() exports all the symbols from the parent process such that the child can find and use them at run-time to resolve any unknown symbols. Here's an example (I'm leaving out some fine-grained details, and it's slightly different on different OS's, but this is "true enough" for the purposes of this thread):

- a.out, which was linked against libopal.so (and friends), launches
- the linker runs into an unresolved symbol
- the linker sees that that symbols is supposed to be in "libopal.so", and starts searching LD_LIBRARY_PATH for it - the linker finds libopal.so, loads it, and is able to resolve the symbol

It gets interesting at this part:

- within MPI_Init()/orte_init()/opal_init() (i.e., however you initialized yourself to OMPI/ORTE/OPAL), we use the Libtool portable dlopen() to open our components - the components will have unresolved symbols as well (i.e., the symbols in libopal, liborte, and libmpi)
- when the linker hits these, it tries to resolve them.
- first, the linker looks in the public namespace of the process, and if it finds the symbols there, it's done - in this case, libopal (and friends) have already been loaded in the process, so the linker can find the symbols right away -- without loading any additional libraries

This is the scheme that we were relying on for libopal/orte/ompi symbols to be resolved in our components. And for standalone executables, it works fine.

But for an environment like Eclipse, it doesn't.

I don't know anything about Eclipse, but I'm assuming that it does something similar to our component system -- it dlopen's them. However -- here's where my guess comes in -- it doesn't make all the symbols in the opened component be in the public namespace of the process (this is different than what OMPI does, for various reasons). Hence, if you build an Eclipse component against OMPI, the Eclipse component will be dynamically linked against libopal (etc.). So when Eclipse loads in your component, similar to the standalone executable example above, the linker will realize that it has unresolved symbols and will use the normal mechanism to resolve them (e.g., look for libopal.so in LD_LIBRARY_PATH).

The problem comes in when we dlopen OMPI/ORTE/OPAL components.

Our scheme assumed that we'd be able to find the opal/orte/ompi symbols in the public namespace of the parent process. But they're not -- Eclipse loaded the component in a private namespace, and therefore all the opal/orte/ompi symbols are in that private namespace. And therefore the OMPI/ORTE/OPAL components can't find the symbols, and the linker barfs.

The solution is to change our scheme in OMPI a bit. We just need to add a few lines to all the component Makefile.am's to, in the dynamic case, link the components against their relevant libraries (opal components linked against libopal, orte components linked against liborte and libopal, etc.). This does not make the components significantly larger -- it just adds an entry into the dynamic linker section of the component's resulting .so file indicating "if you have unresolved components, go look in libopal.so" (etc.).

This allows the components themselves to pull in shared libraries when they are dlopened -- if they need to. If the symbols can be resolved in the parent process' public symbol namespace, they still will be (as in the standalone executable example, above). But if they can't be resolved that way, this gives the ability to explicitly find and pull in a shared library and resolve the symbols that way (as in the Eclipse plugin example, above).

Aren't computers fun?  :-)


On Sep 14, 2005, at 12:47 PM, Nathan DeBardeleben wrote:

Let me explain what I'm doing real quickly.

I have a piece of Java code which is calling OMPI calls. It's doing this through JNI (java native interface). Don't worry, you don't have to understand Java to try and help me here. The JNI code is C with some funky macros in it provided by Java.

I have to compile the JNI C code into a shared library and then the Java code will load it dynamically when that class is instantiated.

So - here's my compile line:

[sparkplug]~/<2>ompi > mpicc -I /usr/java/jdk1.5.0_04/include -I /usr/java/jdk1.5.0_04/include/linux -c ptp_ompi_jni.c -fPIC [sparkplug]~/<2>ompi > mpicc -I /usr/java/jdk1.5.0_04/include -I /usr/java/jdk1.5.0_04/include/linux -shared -o libptp_ompi_jni.so ptp_ompi_jni.o

I then have libptp_ompi_jni.so. I then load that from within Java. If I setup my LD_LIBRARY_PATH and some args to the Java VM correctly, then it finds the above library and loads it up. OK - all fine so far.

However, when I call 'orte_init()' it craps out with the following error:

/usr/java/jdk1.5.0_04/bin/java: error while loading shared libraries: /home/ndebard/local/ompi/lib/openmpi/mca_paffinity_linux.so: undefined symbol: mca_base_param_reg_int

So I went digging in mca_paffinity_linux.so looking for that symbol.

[sparkplug]~/<3>openmpi > nm mca_paffinity_linux.so | grep mca_base_param_reg
                 U mca_base_param_reg_int
[sparkplug]~/<3>openmpi >

OK.  So it's undefined in that .so.
I'm really not a library guy (can't you tell from my myriad of mails?). What does this mean? I went back digging in the parent directory, /home/ndebard/local/ompi/lib, to find the symbol.

[sparkplug]~/<2>lib > nm libopal.so | grep mca_base_param_reg_int
000000000001ce00 T mca_base_param_reg_int
000000000001cea3 T mca_base_param_reg_int_name
[sparkplug]~/<2>lib >

OK so I read this as it's defined in opal.so.
Do you have any idea why my JNI library is trying to load mca_paffinity_linux.so? I went back to my compile line and added -lopal -lmpi -lorte just in case, but that didn't help.

Again, Jeff, I know this isn't really your concern (unless you want a wicked OMPI graphical demo at SC!) :) but I wanted to drop it out there in case you had any insight. I'm kinda stumped on this one.

Does it mean my ompi compile is bad?

-- Nathan
Correspondence
---------------------------------------------------------------------
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
---------------------------------------------------------------------



Jeff Squyres wrote:

Maybe I'm dense -- I thought you couldn't use --shared when linking to a static library...?

If you want to build OMPI as a shared library, then ditch the --enable-static --disable-shared from your configure line (building OMPI as shared is the default, which is how I build 95% of the time).



On Sep 12, 2005, at 5:47 PM, Nathan DeBardeleben wrote:


I've been having this problem for a week or so and I've been asking
other people to weigh in if they know what I'm doing wrong. I've gotten
no where on this so I figure I'll finally drop it out on the list.
First, here's the important info:
The machine:


[sparkplug]~ > cat /etc/issue

Welcome to SuSE Linux 9.1 (x86-64) - Kernel \r (\l).


[sparkplug]~ > uname -a
Linux sparkplug 2.6.10 #4 SMP Wed Jan 26 11:50:00 MST 2005 x86_64
x86_64 x86_64 GNU/Linux

My versions of libtool, autoconf, automake:


[sparkplug]~ > libtool --version
ltmain.sh (GNU libtool) 1.5.20 (1.1220.2.287 2005/08/31 18:54:15)

Copyright (C) 2005  Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.
[sparkplug]~ > autoconf --version
autoconf (GNU Autoconf) 2.59
Written by David J. MacKenzie and Akim Demaille.

Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.
[sparkplug]~ > automake --version
automake (GNU automake) 1.8.5
Written by Tom Tromey <tro...@redhat.com>.

Copyright 2004 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.
[sparkplug]~ >

My ompi version: 7322 - but this has been going on for a few days like I
said and I've been updating a lot, with no progress.

Configured using:


$ ./configure --enable-static --disable-shared --without-threads
--prefix=/home/ndebard/local/ompi --with-devel-headers
--enable-mca-no-build=ptl-gm

Simple C file which I will compile into a shared library:


int test_compile(int x) {
   int rc;

   rc = orte_init(true);
   printf("rc = %d\n", rc);

   return x + 1;
}

Above file is named 'testlib.c'

OK, so let's build this:


[sparkplug]~/ompi-test > mpicc -c testlib.c
[sparkplug]~/ompi-test > mpicc -shared -o libtestlib.so testlib.o
/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../x86_64-suse- linux/bin/ld: testlib.o: relocation R_X86_64_32 can not be used when making a shared
object; recompile with -fPIC
testlib.o: could not read symbols: Bad value
collect2: ld returned 1 exit status

OK so relocation problems. Maybe I'll follow the directions and -fPIC
my file myself:


[sparkplug]~/ompi-test > mpicc -c testlib.c -fPIC
[sparkplug]~/ompi-test > mpicc -shared -o libtestlib.so testlib.o
/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../x86_64-suse- linux/bin/ld:
/home/ndebard/local/ompi/lib/liborte.a(orte_init.o): relocation
R_X86_64_32 can not be used when making a shared object; recompile
with -fPIC
/home/ndebard/local/ompi/lib/liborte.a: could not read symbols: Bad value
collect2: ld returned 1 exit status

OK so I read this as there's a relocation problem in 'liborte.a'.  I
un-arred liborte.a and checked some of the files with 'file' and it says 64bit. I havn't yet written a script to check every file in here, but
here's orte_init.o:


[sparkplug]~/<1>tmp > file orte_init.o
orte_init.o: ELF 64-bit LSB relocatable, AMD x86-64, version 1 (SYSV),
not stripped

So that at least says it's 64bit.
And to confirm, my mpicc's 64bit too:


[sparkplug]~/<1>tmp > which mpicc
/home/ndebard/local/ompi/bin/mpicc
[sparkplug]~/<1>tmp > file /home/ndebard/local/ompi/bin/mpicc
/home/ndebard/local/ompi/bin/mpicc: ELF 64-bit LSB executable, AMD
x86-64, version 1 (SYSV), for GNU/Linux 2.4.1, dynamically linked
(uses shared libs), not stripped

Someone suggested I take out the 'disabled-shared' from the configure
line, so I did.  The result was the same.

So the result is that I can not build a shared library on a 64bit linux
machine that uses orte calls.
So then I tried taking out the orte calls and instead use MPI calls.
Sure, this function makes no sense but here it is now:


#include "orte_config.h"
#include <mpi.h>

int test_compile(int x) {
   MPI_Comm_rank(MPI_COMM_WORLD, &x);

   return x + 1;
}

And now, when I try and make a shared object I get relocation errors:


/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../x86_64-suse- linux/bin/ld:
/home/ndebard/local/ompi/lib/libmpi.a(comm_init.o): relocation
R_X86_64_32 can not be used when making a shared object; recompile
with -fPIC
/home/ndebard/local/ompi/lib/libmpi.a: could not read symbols: Bad value

So... could perhaps the build be messed up and not be really using 64bit
code?
Am I the only one seeing this?  It's a trivial test for those of you
with access to a 64bit machine if you wouldn't mind testing for me.

Help would be greatly appreciated.

--
-- Nathan
Correspondence
---------------------------------------------------------------------
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
---------------------------------------------------------------------

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel






--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

Reply via email to