On Mar 5, 2010, at 6:02 PM, Jeff Squyres (jsquyres) wrote:
> I wondered aloud on IM to Terry after your earlier emails if we should just
> custom-patch ltdl in OMPI to fix this issue. The problem is that libltdl is
> effectively reporting the "wrong" error back to OMPI, so the error string
> that we get to print out ends up not being very useful (e.g., not showing
> which symbol was missing, or what the problem was with the dlopen). Fixing
> this properly in libltdl is actually somewhat tricky -- which is why it
> hasn't been fixed yet. But given that OMPI's use of libltdl is pretty
> specific, we might be able to get away with a simple fix that works just for
> OMPI (but wouldn't necessarily be suitable for all other libltdl users).
I made a patch for exactly what I described: it comments out the preopen
module's setting of FILE_NOT_FOUND. But now I'm getting foiled by the use of
RTLD_LAZY. For example, if I add a bogus symbol that can't be resolved into
the TCP BTL, I get this when I run ompi_info:
-----
...lots of ompi_info config output...
MPI_MAX_PORT_NAME: 1024
MPI_MAX_DATAREP_STRING: 128
dyld: lazy symbol binding failed: Symbol not found:
_jeffs_symbol_that_does_not_exist
Referenced from: /Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp.so
Expected in: flat namespace
[ ompi_info aborts ]
-----
This is happening because libltdl's dlopen() is being invoked with RTLD_LAZY so
the fact that a symbol can't be resolved at dlopen() time is not a problem. It
becomes a fatal problem later when the component's open function is invoked and
my unresolved symbol is exposed in all of its glory.
If I manually change the LT_LAZY_OR_NOW to RTLD_NOW in the
libltdl/loaders/dlopen.c, then I get the behavior I was expecting:
------
...lots of ompi_info config output...
MPI_MAX_PORT_NAME: 1024
MPI_MAX_DATAREP_STRING: 128
[rtp-jsquyres-8717.cisco.com:89384] mca: base: component_find: unable to open
/Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp:
dlopen(/Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp.so, 10): Symbol not found:
_jeffs_symbol_that_does_not_exist
Referenced from: /Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp.so
Expected in: flat namespace
in /Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp.so (ignored)
MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.7)
MCA paffinity: darwin (MCA v2.0, API v2.0, Component v1.7)
...lots of ompi_info config output...
-----
I.e., the dlopen() fails and my patch causes us to actually get a reasonable
error message from libltdl.
So:
1. Given that I'm seeing this on both Linux (RHEL4) and OSX, the LT_LAZY_OR_NOW
must be resolving the RTLD_LAZY on both Linux and OSX -- so how are you getting
the error message that you're getting? Is your system somehow using RTLD_NOW?
2. If OSX and Linux both use RTLD_LAZY, is my patch useful? I'm hesitant to
add it if it's only partially (or not at all) useful...
--
Jeff Squyres
[email protected]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/