Tyson,

thanks for taking the time to do some more tests.


This is really a bug in Open MPI, and unlike what I thought earlier, there are still

some abstraction violations here and there related to ompio.


I filed https://github.com/open-mpi/ompi/pull/5263 in order to address them


Meanwhile, you can configure Open MPI with --disable-dlopen and hopefully, that will be

enought to hide the issue.


Cheers,


Gilles


On 6/12/2018 5:58 AM, Tyson Whitehead wrote:
I have now also tried release 3.1.0.  Same thing (were I have replaced
/nix/store/glx60yay0hmmizhlxhqhnx9w3k4j9g1z-openmpi-3.1.0 with ....)

[orc-login2:107400] mca_base_component_repository_open: unable to open
mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so:
undefined symbol: mca_common_ompio_file_write (ignored)
[orc-login2:107400] mca_base_component_repository_open: unable to open
mca_fcoll_dynamic_gen2: .../lib/openmpi/mca_fcoll_dynamic_gen2.so:
undefined symbol: mca_common_ompio_register_print_entry (ignored)
[orc-login2:107400] mca_base_component_repository_open: unable to open
mca_fcoll_dynamic: .../lib/openmpi/mca_fcoll_dynamic.so: undefined
symbol: mca_common_ompio_register_print_entry (ignored)
[orc-login2:107400] mca_base_component_repository_open: unable to open
mca_fcoll_two_phase: .../lib/openmpi/mca_fcoll_two_phase.so: undefined
symbol: mca_common_ompio_register_print_entry (ignored)
[orc-login2:107400] mca_base_component_repository_open: unable to open
mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so: undefined
symbol: mca_common_ompio_register_print_entry (ignored)
                  Package: Open MPI nixbld@localhost Distribution
                 Open MPI: 3.1.0
   Open MPI repo revision: v3.1.0
    Open MPI release date: May 07, 2018
  pppp               Open RTE: 3.1.0
   Open RTE repo revision: v3.1.0
    Open RTE release date: May 07, 2018
                     OPAL: 3.1.0
        OPAL repo revision: v3.1.0
        OPAL release date: May 07, 2018

I straced the process, and, as far as I could tell, it was just mostly
opening the shared objects in alphabetical order.  Would appreciate
any insight, such as whether this is normal behaviour I can ignore or
not?

Thanks!  -Tyson
On Fri, 8 Jun 2018 at 17:37, Tyson Whitehead <twhiteh...@gmail.com> wrote:
This email starts out talking about version 1.10.7 to give a complete
picture.  I tested 2.1.3 as well, it also exhibits this issue,
although to a lesser extent though, and am asking for help on that
release.

I was compiling the OpenMPI 1.10.7 shipped with NixOS against a newer
libibverbs with a large set of drivers and get some strange errors
when when running opmi_info (I've replaced the common prefix
/nix/store/9zm0pqsh67fw0xi5cpnybnd7hgzryffs-openmpi-1.10.7 with ...)

[mon241:04077] mca: base: component_find: unable to open
.../lib/openmpi/mca_btl_openib: .../lib/openmpi/mca_btl_openib.so:
undefined symbol: mca_mpool_grdma_evict (ignored)
[mon241:04077] mca: base: component_find: unable to open
.../lib/openmpi/mca_fcoll_individual:
.../lib/openmpi/mca_fcoll_individual.so: undefined symbol:
mca_io_ompio_file_write (ignored)
[mon241:04077] mca: base: component_find: unable to open
.../lib/openmpi/mca_fcoll_ylib: .../lib/openmpi/mca_fcoll_ylib.so:
undefined symbol: ompi_io_ompio_scatter_data (ignored)
[mon241:04077] mca: base: component_find: unable to open
.../lib/openmpi/mca_fcoll_dynamic:
.../lib/openmpi/mca_fcoll_dynamic.so: undefined symbol:
ompi_io_ompio_allgatherv_array (ignored)
[mon241:04077] mca: base: component_find: unable to open
.../lib/openmpi/mca_fcoll_two_phase:
.../lib/openmpi/mca_fcoll_two_phase.so: undefined symbol:
ompi_io_ompio_set_aggregator_props (ignored)
[mon241:04077] mca: base: component_find: unable to open
.../lib/openmpi/mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so:
undefined symbol: ompi_io_ompio_allgather_array (ignored)
                  Package: Open MPI nixbld@ Distribution
                Open MPI: 1.10.7
  Open MPI repo revision: v1.10.6-48-g5e373bf
   Open MPI release date: May 16, 2017
                Open RTE: 1.10.7
  Open RTE repo revision: v1.10.6-48-g5e373bf
   Open RTE release date: May 16, 2017
                    OPAL: 1.10.7
      OPAL repo revision: v1.10.6-48-g5e373bf
       OPAL release date: May 16, 2017
...

I dug into the first of these (figured out what library provided it,
looked at the declared dependencies, poked around in the automake
file) , and, as far as I could determine, it seems that
mca_btl_openib.so simply isn't linked to list mca_mpool_grdma.so
(which provides the symbol) as a dependency.

Seeing as 1.10.7 is no longer supported.  I figured I would try 2.1.3
in case this has been fixed.  I compiled it up as well, and it seems
all but the mca_fcoll_individual one have been resolved (I've replaced
/nix/store/4kh0zbn8pmdqhvwagicswg70rwnpm570-openmpi-2.1.3 with ...)

[mon241:05544] mca_base_component_repository_open: unable to open
mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so:
undefined symbol: ompio_io_ompio_file_read (ignored)
                  Package: Open MPI nixbld@ Distribution
                Open MPI: 2.1.3
  Open MPI repo revision: v2.1.2-129-gcfd8f3f
   Open MPI release date: Mar 13, 2018
                Open RTE: 2.1.3
  Open RTE repo revision: v2.1.2-129-gcfd8f3f
   Open RTE release date: Mar 13, 2018
                    OPAL: 2.1.3
      OPAL repo revision: v2.1.2-129-gcfd8f3f
       OPAL release date: Mar 13, 2018
...

Again I was able to find this symbol in the mca_io_ompio.so library.
I looked through the source again, and it seems pretty clear that the
function is indeed called, but the library isn't linked to list the
mca_io_ompio.so library as a dependency

Looking through the various shared libraries in the .../lib/openmpi
directory though, and it seems none of them have dependencies on each
other.  How is this suppose to work?  Is the component library just
suppose to load everything so all symbols get resolved?  Is the above
error I'm seeing an error then?

Any insight would be appreciated.

Thanks!  -Tyson

PS:  Please note that the openmpi code was compiled without any
patches and without any special configure flags other than
--prefix=.... (NixOS also adds --diasble-static and
--disable-dependency-tracking by default, but I removed those, it
didn't make a difference)..
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to