On Mar 5, 2010, at 6:02 PM, Jeff Squyres (jsquyres) wrote:

> I wondered aloud on IM to Terry after your earlier emails if we should just 
> custom-patch ltdl in OMPI to fix this issue.  The problem is that libltdl is 
> effectively reporting the "wrong" error back to OMPI, so the error string 
> that we get to print out ends up not being very useful (e.g., not showing 
> which symbol was missing, or what the problem was with the dlopen).  Fixing 
> this properly in libltdl is actually somewhat tricky -- which is why it 
> hasn't been fixed yet.  But given that OMPI's use of libltdl is pretty 
> specific, we might be able to get away with a simple fix that works just for 
> OMPI (but wouldn't necessarily be suitable for all other libltdl users).

I made a patch for exactly what I described: it comments out the preopen 
module's setting of FILE_NOT_FOUND.  But  now I'm getting foiled by the use of 
RTLD_LAZY.  For example, if I add a bogus symbol that can't be resolved into 
the TCP BTL, I get this when I run ompi_info:

-----
...lots of ompi_info config output...
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
dyld: lazy symbol binding failed: Symbol not found: 
_jeffs_symbol_that_does_not_exist
  Referenced from: /Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp.so
  Expected in: flat namespace
[ ompi_info aborts ]
-----

This is happening because libltdl's dlopen() is being invoked with RTLD_LAZY so 
the fact that a symbol can't be resolved at dlopen() time is not a problem.  It 
becomes a fatal problem later when the component's open function is invoked and 
my unresolved symbol is exposed in all of its glory.

If I manually change the LT_LAZY_OR_NOW to RTLD_NOW in the 
libltdl/loaders/dlopen.c, then I get the behavior I was expecting:

------
...lots of ompi_info config output...
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
[rtp-jsquyres-8717.cisco.com:89384] mca: base: component_find: unable to open 
/Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp: 
dlopen(/Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp.so, 10): Symbol not found: 
_jeffs_symbol_that_does_not_exist
  Referenced from: /Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp.so
  Expected in: flat namespace
 in /Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp.so (ignored)
           MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.7)
           MCA paffinity: darwin (MCA v2.0, API v2.0, Component v1.7)
...lots of ompi_info config output...
-----

I.e., the dlopen() fails and my patch causes us to actually get a reasonable 
error message from libltdl.

So:

1. Given that I'm seeing this on both Linux (RHEL4) and OSX, the LT_LAZY_OR_NOW 
must be resolving the RTLD_LAZY on both Linux and OSX -- so how are you getting 
the error message that you're getting?  Is your system somehow using RTLD_NOW?

2. If OSX and Linux both use RTLD_LAZY, is my patch useful?  I'm hesitant to 
add it if it's only partially (or not at all) useful...

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to