Hi there,

this is an issue that I started a while ago on the R HPC SIG mailing
list and which then moved into an off-list conversation with Jeff
Squyres but on which no progress has been made.

I believe that the issue is less with Rmpi than with something
that Rmpi is exposing in OpenMPI specifically on NetBSD, hence
posting here.

(FWIW, I have since had an Rmpi/R/SGE/OpenMPI stack running on
 RHEL/Vmware, once I realised that I had to exclude the virbr0
 interfaces that OpenMPI seemed to take quite a liking to!)

I appreciate that few on the list are running OpenMPI on NetBSD
but, as detailed below, I found the OpenMPI thread

"[OMPI devel] Missing Symbol"

that seems to tie in with the problem I am seeing and. more
importantly, originated away from an NetBSD implementation.

I thus thought I'd stick the guts of the off-list conversation
onto the OpenMPI list and see if anyone else who may have been
involved with the "Missing Symbol" thread has any ideas.

There would seem to have been four emails of relevance from that
off-list conversation, so eyes down, looking for a full house:



=== Part 1 ===

Basically, when I come to load the Rmpi library

> library(Rmpi, lib.loc="/local/scratch/kevin/Pkgs/R/")

I get a swathe of OpenMPI errors (attached below)


[europa.ecs.vuw.ac.nz:09687] mca: base: component_find: unable to open
/usr/pkg/lib/openmpi/mca_carto_auto_detect: perhaps a missing symbol, or
compiled for a different version of Open MPI? (ignored)
[europa.ecs.vuw.ac.nz:09687] mca: base: component_find: unable to open
/usr/pkg/lib/openmpi/mca_carto_file: perhaps a missing symbol, or compiled
for a different version of Open MPI? (ignored)


=== Part 2 ===

> An off the wall question -- do you have multiple versions of Open MPI
> installed on the system, perchance?  I wonder if you compiled Rmpi.so with
> one version of OMPI and it's picking up libmpi.so from the other version
> (or something along those lines).  Mismatches between the versions might
> well cause issues like this...?

That's not it. Everything is from 1.4.1.

I have once again delved deeper into the innards of the OpenMPI
source than I would have expected and seen that the error message
is coming from just after

File:
opal/mca/base/mca_base_component_find.c

Routine:
static int open_component(component_file_item_t *target_file,
                       opal_list_t *found_components)

Code:
#if OPAL_HAVE_LTDL_ADVISE
  component_handle = lt_dlopenadvise(target_file->filename,
opal_mca_dladvise);
#else
  component_handle = lt_dlopenext(target_file->filename);
#endif


where there's a bit of ferkling going on so as to check for
a given file existing, hence the "slightly better error message".

We have

./opal/include/opal_config.h:#define OPAL_HAVE_LTDL_ADVISE 0

so we are invoking the lt_dlopenext clause.

That is a file that gets patched in the NetBSD build as follows

$diff opal/mca/base/mca_base_component_find.c{.orig,}
44,46d43
<   #ifndef __WINDOWS__
<     #include "opal/libltdl/ltdl.h"
<   #else
48d44
<   #endif

ie we have taken out the inclusion of

opal/libltdl/ltdl.h

to force the use of the NetBSD "ltdl.h" one, which I guess might point
to something underlying the issue but as to what ...

OK, from what I can see, I have

$ls -l /usr/pkg/lib/openmpi/mca_carto_auto_detect*
-rw-r--r-- 1 root wheel 3892 Mar 22 16:21
/usr/pkg/lib/openmpi/mca_carto_auto_detect.a
-rwxr-xr-x 1 root wheel 1105 Mar 22 16:21
/usr/pkg/lib/openmpi/mca_carto_auto_detect.la*
-rwxr-xr-x 1 root wheel 7078 Mar 22 16:21
/usr/pkg/lib/openmpi/mca_carto_auto_detect.so*


however there are no "versioned" links for the .so file
(.so.0, .so.0.0.0 etc) but would that be an issue - probably not.


Furthermore, the Autobook (yes, I read some of that too!) says:

Function: lt_dlhandle lt_dlopenext (const char *filename)
    This function is used in precisely the same way as lt_dlopen. However,
if the search for the named module by exact match against filename
fails, it will try again with a `.la' extension, and then the native
shared library extension (`.sl' on HP-UX, for example).

so the file that will end up being referenced obviously exists,
so why would

lt_dlopenext

not be able to open it the library there?

It would seem (from the error message)that what's being passed
to the routine as

target_file->filename

is

/usr/pkg/lib/openmpi/mca_carto_file

and so lt_dlopenext should at least find the .la and the .so
rather than punt, no ?

I am at a loss as to how to debug further this as my experience of
adding flags to openmpi invocations is zero.


In case you speak libtool (?) I enclose the .la file but nothing
looks "wrong" to me.

Kevin


# mca_carto_auto_detect.la - a libtool library file
# Generated by ltmain.sh (GNU libtool) 2.2.6b
#
# Please DO NOT delete this file!
# It is necessary for linking the library.

# The name that we can dlopen(3).
dlname='mca_carto_auto_detect.so'

# Names of this library.
library_names='mca_carto_auto_detect.so mca_carto_auto_detect.so
mca_carto_auto_
detect.so'

# The name of the static archive.
old_library='mca_carto_auto_detect.a'

# Linker flags that can not go in dependency_libs.
inherited_linker_flags=' -pthread'

# Libraries that this one depends upon.
dependency_libs='-L/usr/pkg/lib -lutil -lm -lpthread'

# Names of additional weak libraries provided by this library
weak_library_names=''

# Version information for mca_carto_auto_detect.
current=0
age=0
revision=0

# Is this an already installed library?
installed=yes

# Should we warn about portability when linking against -modules?
shouldnotlink=yes

# Files to dlopen/dlpreopen
dlopen=''
dlpreopen=''

# Directory that this library needs to be installed in:
libdir='/usr/pkg/lib/openmpi'

# This file has been modified by buildlink3.


=== Part 3 ===

> Furthermore, the Autobook (yes, I read some of that too!)

Indeed, I read so much I thought I'd compile the example
it has about accessing a dynamic library, vis:

http://sourceware.org/autobook/autobook/autobook_169.html

A slightly modified and compiled version of that code shows:

$./test_lt_dlopnext /usr/pkg/lib/openmpi/mca_carto_auto_detect.so thing
./test_lt_dlopnext: file not found

so for some reason, when used in this context, that lt_dlopnext()
is failing.

However, as we know, OpenMPI does run on our systems here, so why
would we be tickling:

opal/mca/base/mca_base_component_find.c

here and not in other invocations of OpenMPI ?


=== Part 4 ===

Is this, in someway, tied into the OpenMPI devel thread I
just found containing this posting ?

http://www.open-mpi.org/community/lists/devel/2010/03/7556.php

  Subject: Re: [OMPI devel] Missing Symbol
  From: Jeff Squyres (jsquyres_at_[hidden])
  Date: 2010-03-05 18:26:13

In case it is of any benefit - as the actualities of what OpenMPI
is doing here may be escaping me and there's a suggestion in it
that implies dlopenext fails "silently" if there's a missing
symbol - what would be the missing symbol in this ?

 $ldd /usr/pkg/lib/openmpi/mca_carto_auto_detect.so
/usr/pkg/lib/openmpi/mca_carto_auto_detect.so:
        -lc.12 => /usr/lib/libc.so.12
        -lutil.7 => /usr/lib/libutil.so.7
        -lm.0 => /usr/lib/libm.so.0
        -lpthread.0 => /usr/lib/libpthread.so.0

$nm /usr/pkg/lib/openmpi/mca_carto_auto_detect.so
00001b54 A _DYNAMIC
00001c3c a _GLOBAL_OFFSET_TABLE_
         w _Jv_RegisterClasses
00001a48 d __CTOR_END__
00001a44 d __CTOR_LIST__
00001a50 d __DTOR_END__
00001a4c d __DTOR_LIST__
00000a14 r __EH_FRAME_BEGIN__
00000a14 r __FRAME_END__
00001a54 d __JCR_END__
00001a54 d __JCR_LIST__
00001c74 A __bss_start
         w __cxa_finalize
         w __deregister_frame_info
00000954 t __do_global_ctors_aux
00000750 t __do_global_dtors_aux
00001c68 d __dso_handle
00000887 t __i686.get_pc_thunk.bx
         w __register_frame_info
00001c74 A _edata
00001c90 A _end
00000990 T _fini
000006b0 T _init
00000830 t auto_detect_open
00001c74 b completed.3420
000007cc t frame_dummy
00001b38 d loc_module
         U mca_base_param_find
         U mca_base_param_lookup_int
         U mca_base_param_reg_int
00001a60 D mca_carto_auto_detect_component
00001c78 b object.3478
00000890 t opal_carto_auto_detect_component_query
00001c70 d opal_carto_auto_detect_component_version_string
000008f0 t opal_carto_auto_detect_finalize
00000920 t opal_carto_auto_detect_init
         U opal_carto_base_common_host_graph
         U opal_carto_base_free_graph_fn
         U opal_carto_base_get_nodes_distance_fn
         U opal_carto_base_graph_create_fn
         U opal_carto_base_graph_find_node_fn
         U opal_carto_base_graph_get_host_graph_fn
         U opal_carto_base_graph_spf_fn
00001c6c d p.3418


Hoping against hope that someone might have an idea,
Kevin

-- 
Kevin M. Buckley                                  Room:  CO327
School of Engineering and                         Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand

Reply via email to