Hi there, this is an issue that I started a while ago on the R HPC SIG mailing list and which then moved into an off-list conversation with Jeff Squyres but on which no progress has been made.
I believe that the issue is less with Rmpi than with something that Rmpi is exposing in OpenMPI specifically on NetBSD, hence posting here. (FWIW, I have since had an Rmpi/R/SGE/OpenMPI stack running on RHEL/Vmware, once I realised that I had to exclude the virbr0 interfaces that OpenMPI seemed to take quite a liking to!) I appreciate that few on the list are running OpenMPI on NetBSD but, as detailed below, I found the OpenMPI thread "[OMPI devel] Missing Symbol" that seems to tie in with the problem I am seeing and. more importantly, originated away from an NetBSD implementation. I thus thought I'd stick the guts of the off-list conversation onto the OpenMPI list and see if anyone else who may have been involved with the "Missing Symbol" thread has any ideas. There would seem to have been four emails of relevance from that off-list conversation, so eyes down, looking for a full house: === Part 1 === Basically, when I come to load the Rmpi library > library(Rmpi, lib.loc="/local/scratch/kevin/Pkgs/R/") I get a swathe of OpenMPI errors (attached below) [europa.ecs.vuw.ac.nz:09687] mca: base: component_find: unable to open /usr/pkg/lib/openmpi/mca_carto_auto_detect: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [europa.ecs.vuw.ac.nz:09687] mca: base: component_find: unable to open /usr/pkg/lib/openmpi/mca_carto_file: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) === Part 2 === > An off the wall question -- do you have multiple versions of Open MPI > installed on the system, perchance? I wonder if you compiled Rmpi.so with > one version of OMPI and it's picking up libmpi.so from the other version > (or something along those lines). Mismatches between the versions might > well cause issues like this...? That's not it. Everything is from 1.4.1. I have once again delved deeper into the innards of the OpenMPI source than I would have expected and seen that the error message is coming from just after File: opal/mca/base/mca_base_component_find.c Routine: static int open_component(component_file_item_t *target_file, opal_list_t *found_components) Code: #if OPAL_HAVE_LTDL_ADVISE component_handle = lt_dlopenadvise(target_file->filename, opal_mca_dladvise); #else component_handle = lt_dlopenext(target_file->filename); #endif where there's a bit of ferkling going on so as to check for a given file existing, hence the "slightly better error message". We have ./opal/include/opal_config.h:#define OPAL_HAVE_LTDL_ADVISE 0 so we are invoking the lt_dlopenext clause. That is a file that gets patched in the NetBSD build as follows $diff opal/mca/base/mca_base_component_find.c{.orig,} 44,46d43 < #ifndef __WINDOWS__ < #include "opal/libltdl/ltdl.h" < #else 48d44 < #endif ie we have taken out the inclusion of opal/libltdl/ltdl.h to force the use of the NetBSD "ltdl.h" one, which I guess might point to something underlying the issue but as to what ... OK, from what I can see, I have $ls -l /usr/pkg/lib/openmpi/mca_carto_auto_detect* -rw-r--r-- 1 root wheel 3892 Mar 22 16:21 /usr/pkg/lib/openmpi/mca_carto_auto_detect.a -rwxr-xr-x 1 root wheel 1105 Mar 22 16:21 /usr/pkg/lib/openmpi/mca_carto_auto_detect.la* -rwxr-xr-x 1 root wheel 7078 Mar 22 16:21 /usr/pkg/lib/openmpi/mca_carto_auto_detect.so* however there are no "versioned" links for the .so file (.so.0, .so.0.0.0 etc) but would that be an issue - probably not. Furthermore, the Autobook (yes, I read some of that too!) says: Function: lt_dlhandle lt_dlopenext (const char *filename) This function is used in precisely the same way as lt_dlopen. However, if the search for the named module by exact match against filename fails, it will try again with a `.la' extension, and then the native shared library extension (`.sl' on HP-UX, for example). so the file that will end up being referenced obviously exists, so why would lt_dlopenext not be able to open it the library there? It would seem (from the error message)that what's being passed to the routine as target_file->filename is /usr/pkg/lib/openmpi/mca_carto_file and so lt_dlopenext should at least find the .la and the .so rather than punt, no ? I am at a loss as to how to debug further this as my experience of adding flags to openmpi invocations is zero. In case you speak libtool (?) I enclose the .la file but nothing looks "wrong" to me. Kevin # mca_carto_auto_detect.la - a libtool library file # Generated by ltmain.sh (GNU libtool) 2.2.6b # # Please DO NOT delete this file! # It is necessary for linking the library. # The name that we can dlopen(3). dlname='mca_carto_auto_detect.so' # Names of this library. library_names='mca_carto_auto_detect.so mca_carto_auto_detect.so mca_carto_auto_ detect.so' # The name of the static archive. old_library='mca_carto_auto_detect.a' # Linker flags that can not go in dependency_libs. inherited_linker_flags=' -pthread' # Libraries that this one depends upon. dependency_libs='-L/usr/pkg/lib -lutil -lm -lpthread' # Names of additional weak libraries provided by this library weak_library_names='' # Version information for mca_carto_auto_detect. current=0 age=0 revision=0 # Is this an already installed library? installed=yes # Should we warn about portability when linking against -modules? shouldnotlink=yes # Files to dlopen/dlpreopen dlopen='' dlpreopen='' # Directory that this library needs to be installed in: libdir='/usr/pkg/lib/openmpi' # This file has been modified by buildlink3. === Part 3 === > Furthermore, the Autobook (yes, I read some of that too!) Indeed, I read so much I thought I'd compile the example it has about accessing a dynamic library, vis: http://sourceware.org/autobook/autobook/autobook_169.html A slightly modified and compiled version of that code shows: $./test_lt_dlopnext /usr/pkg/lib/openmpi/mca_carto_auto_detect.so thing ./test_lt_dlopnext: file not found so for some reason, when used in this context, that lt_dlopnext() is failing. However, as we know, OpenMPI does run on our systems here, so why would we be tickling: opal/mca/base/mca_base_component_find.c here and not in other invocations of OpenMPI ? === Part 4 === Is this, in someway, tied into the OpenMPI devel thread I just found containing this posting ? http://www.open-mpi.org/community/lists/devel/2010/03/7556.php Subject: Re: [OMPI devel] Missing Symbol From: Jeff Squyres (jsquyres_at_[hidden]) Date: 2010-03-05 18:26:13 In case it is of any benefit - as the actualities of what OpenMPI is doing here may be escaping me and there's a suggestion in it that implies dlopenext fails "silently" if there's a missing symbol - what would be the missing symbol in this ? $ldd /usr/pkg/lib/openmpi/mca_carto_auto_detect.so /usr/pkg/lib/openmpi/mca_carto_auto_detect.so: -lc.12 => /usr/lib/libc.so.12 -lutil.7 => /usr/lib/libutil.so.7 -lm.0 => /usr/lib/libm.so.0 -lpthread.0 => /usr/lib/libpthread.so.0 $nm /usr/pkg/lib/openmpi/mca_carto_auto_detect.so 00001b54 A _DYNAMIC 00001c3c a _GLOBAL_OFFSET_TABLE_ w _Jv_RegisterClasses 00001a48 d __CTOR_END__ 00001a44 d __CTOR_LIST__ 00001a50 d __DTOR_END__ 00001a4c d __DTOR_LIST__ 00000a14 r __EH_FRAME_BEGIN__ 00000a14 r __FRAME_END__ 00001a54 d __JCR_END__ 00001a54 d __JCR_LIST__ 00001c74 A __bss_start w __cxa_finalize w __deregister_frame_info 00000954 t __do_global_ctors_aux 00000750 t __do_global_dtors_aux 00001c68 d __dso_handle 00000887 t __i686.get_pc_thunk.bx w __register_frame_info 00001c74 A _edata 00001c90 A _end 00000990 T _fini 000006b0 T _init 00000830 t auto_detect_open 00001c74 b completed.3420 000007cc t frame_dummy 00001b38 d loc_module U mca_base_param_find U mca_base_param_lookup_int U mca_base_param_reg_int 00001a60 D mca_carto_auto_detect_component 00001c78 b object.3478 00000890 t opal_carto_auto_detect_component_query 00001c70 d opal_carto_auto_detect_component_version_string 000008f0 t opal_carto_auto_detect_finalize 00000920 t opal_carto_auto_detect_init U opal_carto_base_common_host_graph U opal_carto_base_free_graph_fn U opal_carto_base_get_nodes_distance_fn U opal_carto_base_graph_create_fn U opal_carto_base_graph_find_node_fn U opal_carto_base_graph_get_host_graph_fn U opal_carto_base_graph_spf_fn 00001c6c d p.3418 Hoping against hope that someone might have an idea, Kevin -- Kevin M. Buckley Room: CO327 School of Engineering and Phone: +64 4 463 5971 Computer Science Victoria University of Wellington New Zealand