On Mar 5, 2010, at 6:02 PM, Jeff Squyres (jsquyres) wrote: > I wondered aloud on IM to Terry after your earlier emails if we should just > custom-patch ltdl in OMPI to fix this issue. The problem is that libltdl is > effectively reporting the "wrong" error back to OMPI, so the error string > that we get to print out ends up not being very useful (e.g., not showing > which symbol was missing, or what the problem was with the dlopen). Fixing > this properly in libltdl is actually somewhat tricky -- which is why it > hasn't been fixed yet. But given that OMPI's use of libltdl is pretty > specific, we might be able to get away with a simple fix that works just for > OMPI (but wouldn't necessarily be suitable for all other libltdl users).
I made a patch for exactly what I described: it comments out the preopen module's setting of FILE_NOT_FOUND. But now I'm getting foiled by the use of RTLD_LAZY. For example, if I add a bogus symbol that can't be resolved into the TCP BTL, I get this when I run ompi_info: ----- ...lots of ompi_info config output... MPI_MAX_PORT_NAME: 1024 MPI_MAX_DATAREP_STRING: 128 dyld: lazy symbol binding failed: Symbol not found: _jeffs_symbol_that_does_not_exist Referenced from: /Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp.so Expected in: flat namespace [ ompi_info aborts ] ----- This is happening because libltdl's dlopen() is being invoked with RTLD_LAZY so the fact that a symbol can't be resolved at dlopen() time is not a problem. It becomes a fatal problem later when the component's open function is invoked and my unresolved symbol is exposed in all of its glory. If I manually change the LT_LAZY_OR_NOW to RTLD_NOW in the libltdl/loaders/dlopen.c, then I get the behavior I was expecting: ------ ...lots of ompi_info config output... MPI_MAX_PORT_NAME: 1024 MPI_MAX_DATAREP_STRING: 128 [rtp-jsquyres-8717.cisco.com:89384] mca: base: component_find: unable to open /Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp: dlopen(/Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp.so, 10): Symbol not found: _jeffs_symbol_that_does_not_exist Referenced from: /Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp.so Expected in: flat namespace in /Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp.so (ignored) MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.7) MCA paffinity: darwin (MCA v2.0, API v2.0, Component v1.7) ...lots of ompi_info config output... ----- I.e., the dlopen() fails and my patch causes us to actually get a reasonable error message from libltdl. So: 1. Given that I'm seeing this on both Linux (RHEL4) and OSX, the LT_LAZY_OR_NOW must be resolving the RTLD_LAZY on both Linux and OSX -- so how are you getting the error message that you're getting? Is your system somehow using RTLD_NOW? 2. If OSX and Linux both use RTLD_LAZY, is my patch useful? I'm hesitant to add it if it's only partially (or not at all) useful... -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/