Re: [OMPI devel] Shared object dependencies
On Jun 12, 2018, at 7:34 AM, Gabriel, Edgar wrote: > > Well, I am still confused. What is different on nixOS vs. other linux distros > that makes this error appear, Fair enough. I don't think I realized nixOS was a Linux distro. That being said, every time I think I understand linkers, I find out that I don't know jack about linkers. :-( > and is it relevant enough for the backport or should we just go forward for > 4.0? Is it again a RTLD_GLOBAL issue as it was back 2014? Yeah, we should probably figure this one out. I don't know the answer here. > And last but not least, I raised on the github discussion one series question > about the mca parameter names. > > No way for 2 series backport btw., that version did not even have > common/ompio yet, that was introduced in the 3.0 release. Good to know. -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] Shared object dependencies
Well, I am still confused. What is different on nixOS vs. other linux distros that makes this error appear, and is it relevant enough for the backport or should we just go forward for 4.0? Is it again a RTLD_GLOBAL issue as it was back 2014? And last but not least, I raised on the github discussion one series question about the mca parameter names. No way for 2 series backport btw., that version did not even have common/ompio yet, that was introduced in the 3.0 release. Thanks Edgar > -Original Message- > From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Jeff > Squyres (jsquyres) via devel > Sent: Tuesday, June 12, 2018 9:31 AM > To: Open MPI Developers List > Cc: Jeff Squyres (jsquyres) > Subject: Re: [OMPI devel] Shared object dependencies > > On Jun 12, 2018, at 7:21 AM, Gilles Gouaillardet > wrote: > > > > I think this also depends on the linker (configuration ?) and possibly the > order the libraries are dlopen’ed. > > > > Note the issue was initially reported (as warnings only) from ompi_info, so > there is a possibility it we all missed it. > > > > That being said, the errors make perfect sense to me. > > > > fwiw, I installed a NixOS virtual machine and reproduced the issue right > away. > > OIC -- right -- this was reported on NixOS, not vanilla Linux. Ok. > > These fixes will need to be back-ported to at least 3.0.x and 3.1.x, right? > > Do they need to also go to v2.1.x? > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] Shared object dependencies
On Jun 12, 2018, at 7:21 AM, Gilles Gouaillardet wrote: > > I think this also depends on the linker (configuration ?) and possibly the > order the libraries are dlopen’ed. > > Note the issue was initially reported (as warnings only) from ompi_info, so > there is a possibility it we all missed it. > > That being said, the errors make perfect sense to me. > > fwiw, I installed a NixOS virtual machine and reproduced the issue right away. OIC -- right -- this was reported on NixOS, not vanilla Linux. Ok. These fixes will need to be back-ported to at least 3.0.x and 3.1.x, right? Do they need to also go to v2.1.x? -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] Shared object dependencies
I think this also depends on the linker (configuration ?) and possibly the order the libraries are dlopen’ed. Note the issue was initially reported (as warnings only) from ompi_info, so there is a possibility it we all missed it. That being said, the errors make perfect sense to me. fwiw, I installed a NixOS virtual machine and reproduced the issue right away. Cheers, Gilles On Tuesday, June 12, 2018, Gabriel, Edgar wrote: > No, I do not use -disable-dlopen, this is the other thing that is > confusing to me, how comes this error does not occur for anybody else. > Thanks > Edgar > > > -Original Message- > > From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Jeff > > Squyres (jsquyres) via devel > > Sent: Tuesday, June 12, 2018 9:11 AM > > To: Open MPI Developers List > > Cc: Jeff Squyres (jsquyres) > > Subject: Re: [OMPI devel] Shared object dependencies > > > > How is it that Edgar is not running into these issues? > > > > Edgar: are you compiling with --disable-dlopen, perchance? > > > > > > > On Jun 12, 2018, at 6:04 AM, Gilles Gouaillardet > > wrote: > > > > > > Edgar, > > > > > > Regarding this specific problem, the issue is mca_fcoll_individual.so > > > did not depend on libmca_commom_ompio.so, the PR does address that > > > (among other abstraction violations) > > > > > > What about following up in github ? > > > > > > Cheers, > > > > > > Gilles > > > > > > On Tuesday, June 12, 2018, Gabriel, Edgar > > wrote: > > > So , I am still surprised to see this error message: if you look at > lets say just > > one error message (and all others are the same): > > > > > > > > [orc-login2:107400] mca_base_component_repository_open: unable to > > > > > open > > > > > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so: > > > > > undefined symbol: mca_common_ompio_file_write (ignored) > > > > > > How comes that the symbol mca_common_ompio_file_write can not be > > found ? It is in the common, that symbol should always be there, isn't > it? > > > Your fix Gilles (which we can discuss) will not address this problem > in my > > opinion. The symbols at this point that are accessed from the ompio > > component are used through a function pointer, not by name, and that > > should work in my opinion.(e.g. we do not call directly > > mca_io_ompio_set_aggregator_props, but we call the function pointer fh- > > >f_set_aggregator_props), and the same with the mca parmaeters, we access > > them through a function that is stored as a function pointer on the file > > handle structure. > > > > > > Thanks > > > Edgar > > > > > > > > > > -Original Message- > > > > From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of > > > > Gilles Gouaillardet > > > > Sent: Tuesday, June 12, 2018 3:28 AM > > > > To: devel@lists.open-mpi.org > > > > Subject: Re: [OMPI devel] Shared object dependencies > > > > > > > > Tyson, > > > > > > > > > > > > thanks for taking the time to do some more tests. > > > > > > > > > > > > This is really a bug in Open MPI, and unlike what I thought earlier, > > > > there are still > > > > > > > > some abstraction violations here and there related to ompio. > > > > > > > > > > > > I filed https://github.com/open-mpi/ompi/pull/5263 in order to > > > > address them > > > > > > > > > > > > Meanwhile, you can configure Open MPI with --disable-dlopen and > > > > hopefully, that will be > > > > > > > > enought to hide the issue. > > > > > > > > > > > > Cheers, > > > > > > > > > > > > Gilles > > > > > > > > > > > > On 6/12/2018 5:58 AM, Tyson Whitehead wrote: > > > > > I have now also tried release 3.1.0. Same thing (were I have > > > > > replaced > > > > > /nix/store/glx60yay0hmmizhlxhqhnx9w3k4j9g1z-openmpi-3.1.0 with > > > > > ) > > > > > > > > > > [orc-login2:107400] mca_base_component_repository_open: unable to > > > > > open > > > > > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so: > > > > > undefined symbol: mca_common_ompio_file_write (ignored) > > > > > [orc-login2:107400] mca_base_component_repository_open: unable to > > > > > open > > > > > mca_fcoll_dynamic_gen2: .../lib/openmpi/mca_fcoll_dynamic_gen2.so: > > > > > undefined symbol: mca_common_ompio_register_print_entry (ignored) > > > > > [orc-login2:107400] mca_base_component_repository_open: unable to > > > > > open > > > > > mca_fcoll_dynamic: .../lib/openmpi/mca_fcoll_dynamic.so: undefined > > > > > symbol: mca_common_ompio_register_print_entry (ignored) > > > > > [orc-login2:107400] mca_base_component_repository_open: unable to > > > > > open > > > > > mca_fcoll_two_phase: .../lib/openmpi/mca_fcoll_two_phase.so: > > > > > undefined > > > > > symbol: mca_common_ompio_register_print_entry (ignored) > > > > > [orc-login2:107400] mca_base_component_repository_open: unable to > > > > > open > > > > > mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so: undefined > > > > > symbol: mca_common_ompio_register_print_entry (ignored) > > > > > Package: Open MPI nix
Re: [OMPI devel] Shared object dependencies
No, I do not use -disable-dlopen, this is the other thing that is confusing to me, how comes this error does not occur for anybody else. Thanks Edgar > -Original Message- > From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Jeff > Squyres (jsquyres) via devel > Sent: Tuesday, June 12, 2018 9:11 AM > To: Open MPI Developers List > Cc: Jeff Squyres (jsquyres) > Subject: Re: [OMPI devel] Shared object dependencies > > How is it that Edgar is not running into these issues? > > Edgar: are you compiling with --disable-dlopen, perchance? > > > > On Jun 12, 2018, at 6:04 AM, Gilles Gouaillardet > wrote: > > > > Edgar, > > > > Regarding this specific problem, the issue is mca_fcoll_individual.so > > did not depend on libmca_commom_ompio.so, the PR does address that > > (among other abstraction violations) > > > > What about following up in github ? > > > > Cheers, > > > > Gilles > > > > On Tuesday, June 12, 2018, Gabriel, Edgar > wrote: > > So , I am still surprised to see this error message: if you look at lets > > say just > one error message (and all others are the same): > > > > > > [orc-login2:107400] mca_base_component_repository_open: unable to > > > > open > > > > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so: > > > > undefined symbol: mca_common_ompio_file_write (ignored) > > > > How comes that the symbol mca_common_ompio_file_write can not be > found ? It is in the common, that symbol should always be there, isn't it? > > Your fix Gilles (which we can discuss) will not address this problem in my > opinion. The symbols at this point that are accessed from the ompio > component are used through a function pointer, not by name, and that > should work in my opinion.(e.g. we do not call directly > mca_io_ompio_set_aggregator_props, but we call the function pointer fh- > >f_set_aggregator_props), and the same with the mca parmaeters, we access > them through a function that is stored as a function pointer on the file > handle structure. > > > > Thanks > > Edgar > > > > > > > -Original Message- > > > From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of > > > Gilles Gouaillardet > > > Sent: Tuesday, June 12, 2018 3:28 AM > > > To: devel@lists.open-mpi.org > > > Subject: Re: [OMPI devel] Shared object dependencies > > > > > > Tyson, > > > > > > > > > thanks for taking the time to do some more tests. > > > > > > > > > This is really a bug in Open MPI, and unlike what I thought earlier, > > > there are still > > > > > > some abstraction violations here and there related to ompio. > > > > > > > > > I filed https://github.com/open-mpi/ompi/pull/5263 in order to > > > address them > > > > > > > > > Meanwhile, you can configure Open MPI with --disable-dlopen and > > > hopefully, that will be > > > > > > enought to hide the issue. > > > > > > > > > Cheers, > > > > > > > > > Gilles > > > > > > > > > On 6/12/2018 5:58 AM, Tyson Whitehead wrote: > > > > I have now also tried release 3.1.0. Same thing (were I have > > > > replaced > > > > /nix/store/glx60yay0hmmizhlxhqhnx9w3k4j9g1z-openmpi-3.1.0 with > > > > ) > > > > > > > > [orc-login2:107400] mca_base_component_repository_open: unable to > > > > open > > > > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so: > > > > undefined symbol: mca_common_ompio_file_write (ignored) > > > > [orc-login2:107400] mca_base_component_repository_open: unable to > > > > open > > > > mca_fcoll_dynamic_gen2: .../lib/openmpi/mca_fcoll_dynamic_gen2.so: > > > > undefined symbol: mca_common_ompio_register_print_entry (ignored) > > > > [orc-login2:107400] mca_base_component_repository_open: unable to > > > > open > > > > mca_fcoll_dynamic: .../lib/openmpi/mca_fcoll_dynamic.so: undefined > > > > symbol: mca_common_ompio_register_print_entry (ignored) > > > > [orc-login2:107400] mca_base_component_repository_open: unable to > > > > open > > > > mca_fcoll_two_phase: .../lib/openmpi/mca_fcoll_two_phase.so: > > > > undefined > > > > symbol: mca_common_ompio_register_print_entry (ignored) > > > > [orc-login2:107400] mca_base_component_repository_open: unable to > > > > open > > > > mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so: undefined > > > > symbol: mca_common_ompio_register_print_entry (ignored) > > > > Package: Open MPI nixbld@localhost Distribution > > > > Open MPI: 3.1.0 > > > >Open MPI repo revision: v3.1.0 > > > > Open MPI release date: May 07, 2018 > > > > Open RTE: 3.1.0 > > > >Open RTE repo revision: v3.1.0 > > > > Open RTE release date: May 07, 2018 > > > > OPAL: 3.1.0 > > > > OPAL repo revision: v3.1.0 > > > > OPAL release date: May 07, 2018 > > > > > > > > I straced the process, and, as far as I could tell, it was just > > > > mostly opening the shared objects in alphabetical order. Would > > > > appreciate any insight, such as whether this is normal behaviour I > > > > can ignore o
Re: [OMPI devel] Shared object dependencies
How is it that Edgar is not running into these issues? Edgar: are you compiling with --disable-dlopen, perchance? > On Jun 12, 2018, at 6:04 AM, Gilles Gouaillardet > wrote: > > Edgar, > > Regarding this specific problem, the issue is mca_fcoll_individual.so did not > depend on libmca_commom_ompio.so, > the PR does address that (among other abstraction violations) > > What about following up in github ? > > Cheers, > > Gilles > > On Tuesday, June 12, 2018, Gabriel, Edgar wrote: > So , I am still surprised to see this error message: if you look at lets say > just one error message (and all others are the same): > > > > [orc-login2:107400] mca_base_component_repository_open: unable to open > > > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so: > > > undefined symbol: mca_common_ompio_file_write (ignored) > > How comes that the symbol mca_common_ompio_file_write can not be found ? It > is in the common, that symbol should always be there, isn't it? > Your fix Gilles (which we can discuss) will not address this problem in my > opinion. The symbols at this point that are accessed from the ompio component > are used through a function pointer, not by name, and that should work in my > opinion.(e.g. we do not call directly mca_io_ompio_set_aggregator_props, but > we call the function pointer fh->f_set_aggregator_props), and the same with > the mca parmaeters, we access them through a function that is stored as a > function pointer on the file handle structure. > > Thanks > Edgar > > > > -Original Message- > > From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Gilles > > Gouaillardet > > Sent: Tuesday, June 12, 2018 3:28 AM > > To: devel@lists.open-mpi.org > > Subject: Re: [OMPI devel] Shared object dependencies > > > > Tyson, > > > > > > thanks for taking the time to do some more tests. > > > > > > This is really a bug in Open MPI, and unlike what I thought earlier, there > > are > > still > > > > some abstraction violations here and there related to ompio. > > > > > > I filed https://github.com/open-mpi/ompi/pull/5263 in order to address them > > > > > > Meanwhile, you can configure Open MPI with --disable-dlopen and hopefully, > > that will be > > > > enought to hide the issue. > > > > > > Cheers, > > > > > > Gilles > > > > > > On 6/12/2018 5:58 AM, Tyson Whitehead wrote: > > > I have now also tried release 3.1.0. Same thing (were I have replaced > > > /nix/store/glx60yay0hmmizhlxhqhnx9w3k4j9g1z-openmpi-3.1.0 with ) > > > > > > [orc-login2:107400] mca_base_component_repository_open: unable to open > > > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so: > > > undefined symbol: mca_common_ompio_file_write (ignored) > > > [orc-login2:107400] mca_base_component_repository_open: unable to open > > > mca_fcoll_dynamic_gen2: .../lib/openmpi/mca_fcoll_dynamic_gen2.so: > > > undefined symbol: mca_common_ompio_register_print_entry (ignored) > > > [orc-login2:107400] mca_base_component_repository_open: unable to open > > > mca_fcoll_dynamic: .../lib/openmpi/mca_fcoll_dynamic.so: undefined > > > symbol: mca_common_ompio_register_print_entry (ignored) > > > [orc-login2:107400] mca_base_component_repository_open: unable to open > > > mca_fcoll_two_phase: .../lib/openmpi/mca_fcoll_two_phase.so: undefined > > > symbol: mca_common_ompio_register_print_entry (ignored) > > > [orc-login2:107400] mca_base_component_repository_open: unable to open > > > mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so: undefined > > > symbol: mca_common_ompio_register_print_entry (ignored) > > > Package: Open MPI nixbld@localhost Distribution > > > Open MPI: 3.1.0 > > >Open MPI repo revision: v3.1.0 > > > Open MPI release date: May 07, 2018 > > > Open RTE: 3.1.0 > > >Open RTE repo revision: v3.1.0 > > > Open RTE release date: May 07, 2018 > > > OPAL: 3.1.0 > > > OPAL repo revision: v3.1.0 > > > OPAL release date: May 07, 2018 > > > > > > I straced the process, and, as far as I could tell, it was just mostly > > > opening the shared objects in alphabetical order. Would appreciate > > > any insight, such as whether this is normal behaviour I can ignore or > > > not? > > > > > > Thanks! -Tyson > > > On Fri, 8 Jun 2018 at 17:37, Tyson Whitehead > > wrote: > > >> This email starts out talking about version 1.10.7 to give a complete > > >> picture. I tested 2.1.3 as well, it also exhibits this issue, > > >> although to a lesser extent though, and am asking for help on that > > >> release. > > >> > > >> I was compiling the OpenMPI 1.10.7 shipped with NixOS against a newer > > >> libibverbs with a large set of drivers and get some strange errors > > >> when when running opmi_info (I've replaced the common prefix > > >> /nix/store/9zm0pqsh67fw0xi5cpnybnd7hgzryffs-openmpi-1.10.7 with ...) > > >> > > >> [mon241:04077] mca: base: component_find
Re: [OMPI devel] Shared object dependencies
Edgar, Regarding this specific problem, the issue is mca_fcoll_individual.so did not depend on libmca_commom_ompio.so, the PR does address that (among other abstraction violations) What about following up in github ? Cheers, Gilles On Tuesday, June 12, 2018, Gabriel, Edgar wrote: > So , I am still surprised to see this error message: if you look at lets > say just one error message (and all others are the same): > > > > [orc-login2:107400] mca_base_component_repository_open: unable to open > > > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so: > > > undefined symbol: mca_common_ompio_file_write (ignored) > > How comes that the symbol mca_common_ompio_file_write can not be found ? > It is in the common, that symbol should always be there, isn't it? > Your fix Gilles (which we can discuss) will not address this problem in my > opinion. The symbols at this point that are accessed from the ompio > component are used through a function pointer, not by name, and that should > work in my opinion.(e.g. we do not call directly > mca_io_ompio_set_aggregator_props, but we call the function pointer > fh->f_set_aggregator_props), and the same with the mca parmaeters, we > access them through a function that is stored as a function pointer on the > file handle structure. > > Thanks > Edgar > > > > -Original Message- > > From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of > Gilles > > Gouaillardet > > Sent: Tuesday, June 12, 2018 3:28 AM > > To: devel@lists.open-mpi.org > > Subject: Re: [OMPI devel] Shared object dependencies > > > > Tyson, > > > > > > thanks for taking the time to do some more tests. > > > > > > This is really a bug in Open MPI, and unlike what I thought earlier, > there are > > still > > > > some abstraction violations here and there related to ompio. > > > > > > I filed https://github.com/open-mpi/ompi/pull/5263 in order to address > them > > > > > > Meanwhile, you can configure Open MPI with --disable-dlopen and > hopefully, > > that will be > > > > enought to hide the issue. > > > > > > Cheers, > > > > > > Gilles > > > > > > On 6/12/2018 5:58 AM, Tyson Whitehead wrote: > > > I have now also tried release 3.1.0. Same thing (were I have replaced > > > /nix/store/glx60yay0hmmizhlxhqhnx9w3k4j9g1z-openmpi-3.1.0 with ) > > > > > > [orc-login2:107400] mca_base_component_repository_open: unable to open > > > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so: > > > undefined symbol: mca_common_ompio_file_write (ignored) > > > [orc-login2:107400] mca_base_component_repository_open: unable to open > > > mca_fcoll_dynamic_gen2: .../lib/openmpi/mca_fcoll_dynamic_gen2.so: > > > undefined symbol: mca_common_ompio_register_print_entry (ignored) > > > [orc-login2:107400] mca_base_component_repository_open: unable to open > > > mca_fcoll_dynamic: .../lib/openmpi/mca_fcoll_dynamic.so: undefined > > > symbol: mca_common_ompio_register_print_entry (ignored) > > > [orc-login2:107400] mca_base_component_repository_open: unable to open > > > mca_fcoll_two_phase: .../lib/openmpi/mca_fcoll_two_phase.so: undefined > > > symbol: mca_common_ompio_register_print_entry (ignored) > > > [orc-login2:107400] mca_base_component_repository_open: unable to open > > > mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so: undefined > > > symbol: mca_common_ompio_register_print_entry (ignored) > > > Package: Open MPI nixbld@localhost Distribution > > > Open MPI: 3.1.0 > > >Open MPI repo revision: v3.1.0 > > > Open MPI release date: May 07, 2018 > > > Open RTE: 3.1.0 > > >Open RTE repo revision: v3.1.0 > > > Open RTE release date: May 07, 2018 > > > OPAL: 3.1.0 > > > OPAL repo revision: v3.1.0 > > > OPAL release date: May 07, 2018 > > > > > > I straced the process, and, as far as I could tell, it was just mostly > > > opening the shared objects in alphabetical order. Would appreciate > > > any insight, such as whether this is normal behaviour I can ignore or > > > not? > > > > > > Thanks! -Tyson > > > On Fri, 8 Jun 2018 at 17:37, Tyson Whitehead > > wrote: > > >> This email starts out talking about version 1.10.7 to give a complete > > >> picture. I tested 2.1.3 as well, it also exhibits this issue, > > >> although to a lesser extent though, and am asking for help on that > > >> release. > > >> > > >> I was compiling the OpenMPI 1.10.7 shipped with NixOS against a newer > > >> libibverbs with a large set of drivers and get some strange errors > > >> when when running opmi_info (I've replaced the common prefix > > >> /nix/store/9zm0pqsh67fw0xi5cpnybnd7hgzryffs-openmpi-1.10.7 with ...) > > >> > > >> [mon241:04077] mca: base: component_find: unable to open > > >> .../lib/openmpi/mca_btl_openib: .../lib/openmpi/mca_btl_openib.so: > > >> undefined symbol: mca_mpool_grdma_evict (ignored) [mon241:04077] > > mca: > > >> base: component_find: unable to open > > >> .../lib/open
Re: [OMPI devel] Shared object dependencies
So , I am still surprised to see this error message: if you look at lets say just one error message (and all others are the same): > > [orc-login2:107400] mca_base_component_repository_open: unable to open > > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so: > > undefined symbol: mca_common_ompio_file_write (ignored) How comes that the symbol mca_common_ompio_file_write can not be found ? It is in the common, that symbol should always be there, isn't it? Your fix Gilles (which we can discuss) will not address this problem in my opinion. The symbols at this point that are accessed from the ompio component are used through a function pointer, not by name, and that should work in my opinion.(e.g. we do not call directly mca_io_ompio_set_aggregator_props, but we call the function pointer fh->f_set_aggregator_props), and the same with the mca parmaeters, we access them through a function that is stored as a function pointer on the file handle structure. Thanks Edgar > -Original Message- > From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Gilles > Gouaillardet > Sent: Tuesday, June 12, 2018 3:28 AM > To: devel@lists.open-mpi.org > Subject: Re: [OMPI devel] Shared object dependencies > > Tyson, > > > thanks for taking the time to do some more tests. > > > This is really a bug in Open MPI, and unlike what I thought earlier, there are > still > > some abstraction violations here and there related to ompio. > > > I filed https://github.com/open-mpi/ompi/pull/5263 in order to address them > > > Meanwhile, you can configure Open MPI with --disable-dlopen and hopefully, > that will be > > enought to hide the issue. > > > Cheers, > > > Gilles > > > On 6/12/2018 5:58 AM, Tyson Whitehead wrote: > > I have now also tried release 3.1.0. Same thing (were I have replaced > > /nix/store/glx60yay0hmmizhlxhqhnx9w3k4j9g1z-openmpi-3.1.0 with ) > > > > [orc-login2:107400] mca_base_component_repository_open: unable to open > > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so: > > undefined symbol: mca_common_ompio_file_write (ignored) > > [orc-login2:107400] mca_base_component_repository_open: unable to open > > mca_fcoll_dynamic_gen2: .../lib/openmpi/mca_fcoll_dynamic_gen2.so: > > undefined symbol: mca_common_ompio_register_print_entry (ignored) > > [orc-login2:107400] mca_base_component_repository_open: unable to open > > mca_fcoll_dynamic: .../lib/openmpi/mca_fcoll_dynamic.so: undefined > > symbol: mca_common_ompio_register_print_entry (ignored) > > [orc-login2:107400] mca_base_component_repository_open: unable to open > > mca_fcoll_two_phase: .../lib/openmpi/mca_fcoll_two_phase.so: undefined > > symbol: mca_common_ompio_register_print_entry (ignored) > > [orc-login2:107400] mca_base_component_repository_open: unable to open > > mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so: undefined > > symbol: mca_common_ompio_register_print_entry (ignored) > > Package: Open MPI nixbld@localhost Distribution > > Open MPI: 3.1.0 > >Open MPI repo revision: v3.1.0 > > Open MPI release date: May 07, 2018 > > Open RTE: 3.1.0 > >Open RTE repo revision: v3.1.0 > > Open RTE release date: May 07, 2018 > > OPAL: 3.1.0 > > OPAL repo revision: v3.1.0 > > OPAL release date: May 07, 2018 > > > > I straced the process, and, as far as I could tell, it was just mostly > > opening the shared objects in alphabetical order. Would appreciate > > any insight, such as whether this is normal behaviour I can ignore or > > not? > > > > Thanks! -Tyson > > On Fri, 8 Jun 2018 at 17:37, Tyson Whitehead > wrote: > >> This email starts out talking about version 1.10.7 to give a complete > >> picture. I tested 2.1.3 as well, it also exhibits this issue, > >> although to a lesser extent though, and am asking for help on that > >> release. > >> > >> I was compiling the OpenMPI 1.10.7 shipped with NixOS against a newer > >> libibverbs with a large set of drivers and get some strange errors > >> when when running opmi_info (I've replaced the common prefix > >> /nix/store/9zm0pqsh67fw0xi5cpnybnd7hgzryffs-openmpi-1.10.7 with ...) > >> > >> [mon241:04077] mca: base: component_find: unable to open > >> .../lib/openmpi/mca_btl_openib: .../lib/openmpi/mca_btl_openib.so: > >> undefined symbol: mca_mpool_grdma_evict (ignored) [mon241:04077] > mca: > >> base: component_find: unable to open > >> .../lib/openmpi/mca_fcoll_individual: > >> .../lib/openmpi/mca_fcoll_individual.so: undefined symbol: > >> mca_io_ompio_file_write (ignored) > >> [mon241:04077] mca: base: component_find: unable to open > >> .../lib/openmpi/mca_fcoll_ylib: .../lib/openmpi/mca_fcoll_ylib.so: > >> undefined symbol: ompi_io_ompio_scatter_data (ignored) [mon241:04077] > >> mca: base: component_find: unable to open > >> .../lib/openmpi/mca_fcoll_dynamic: > >> .../lib/openmpi/mca_fcoll_dynamic.so: undefined symbol:
Re: [OMPI devel] Shared object dependencies
Tyson, thanks for taking the time to do some more tests. This is really a bug in Open MPI, and unlike what I thought earlier, there are still some abstraction violations here and there related to ompio. I filed https://github.com/open-mpi/ompi/pull/5263 in order to address them Meanwhile, you can configure Open MPI with --disable-dlopen and hopefully, that will be enought to hide the issue. Cheers, Gilles On 6/12/2018 5:58 AM, Tyson Whitehead wrote: I have now also tried release 3.1.0. Same thing (were I have replaced /nix/store/glx60yay0hmmizhlxhqhnx9w3k4j9g1z-openmpi-3.1.0 with ) [orc-login2:107400] mca_base_component_repository_open: unable to open mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so: undefined symbol: mca_common_ompio_file_write (ignored) [orc-login2:107400] mca_base_component_repository_open: unable to open mca_fcoll_dynamic_gen2: .../lib/openmpi/mca_fcoll_dynamic_gen2.so: undefined symbol: mca_common_ompio_register_print_entry (ignored) [orc-login2:107400] mca_base_component_repository_open: unable to open mca_fcoll_dynamic: .../lib/openmpi/mca_fcoll_dynamic.so: undefined symbol: mca_common_ompio_register_print_entry (ignored) [orc-login2:107400] mca_base_component_repository_open: unable to open mca_fcoll_two_phase: .../lib/openmpi/mca_fcoll_two_phase.so: undefined symbol: mca_common_ompio_register_print_entry (ignored) [orc-login2:107400] mca_base_component_repository_open: unable to open mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so: undefined symbol: mca_common_ompio_register_print_entry (ignored) Package: Open MPI nixbld@localhost Distribution Open MPI: 3.1.0 Open MPI repo revision: v3.1.0 Open MPI release date: May 07, 2018 Open RTE: 3.1.0 Open RTE repo revision: v3.1.0 Open RTE release date: May 07, 2018 OPAL: 3.1.0 OPAL repo revision: v3.1.0 OPAL release date: May 07, 2018 I straced the process, and, as far as I could tell, it was just mostly opening the shared objects in alphabetical order. Would appreciate any insight, such as whether this is normal behaviour I can ignore or not? Thanks! -Tyson On Fri, 8 Jun 2018 at 17:37, Tyson Whitehead wrote: This email starts out talking about version 1.10.7 to give a complete picture. I tested 2.1.3 as well, it also exhibits this issue, although to a lesser extent though, and am asking for help on that release. I was compiling the OpenMPI 1.10.7 shipped with NixOS against a newer libibverbs with a large set of drivers and get some strange errors when when running opmi_info (I've replaced the common prefix /nix/store/9zm0pqsh67fw0xi5cpnybnd7hgzryffs-openmpi-1.10.7 with ...) [mon241:04077] mca: base: component_find: unable to open .../lib/openmpi/mca_btl_openib: .../lib/openmpi/mca_btl_openib.so: undefined symbol: mca_mpool_grdma_evict (ignored) [mon241:04077] mca: base: component_find: unable to open .../lib/openmpi/mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so: undefined symbol: mca_io_ompio_file_write (ignored) [mon241:04077] mca: base: component_find: unable to open .../lib/openmpi/mca_fcoll_ylib: .../lib/openmpi/mca_fcoll_ylib.so: undefined symbol: ompi_io_ompio_scatter_data (ignored) [mon241:04077] mca: base: component_find: unable to open .../lib/openmpi/mca_fcoll_dynamic: .../lib/openmpi/mca_fcoll_dynamic.so: undefined symbol: ompi_io_ompio_allgatherv_array (ignored) [mon241:04077] mca: base: component_find: unable to open .../lib/openmpi/mca_fcoll_two_phase: .../lib/openmpi/mca_fcoll_two_phase.so: undefined symbol: ompi_io_ompio_set_aggregator_props (ignored) [mon241:04077] mca: base: component_find: unable to open .../lib/openmpi/mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so: undefined symbol: ompi_io_ompio_allgather_array (ignored) Package: Open MPI nixbld@ Distribution Open MPI: 1.10.7 Open MPI repo revision: v1.10.6-48-g5e373bf Open MPI release date: May 16, 2017 Open RTE: 1.10.7 Open RTE repo revision: v1.10.6-48-g5e373bf Open RTE release date: May 16, 2017 OPAL: 1.10.7 OPAL repo revision: v1.10.6-48-g5e373bf OPAL release date: May 16, 2017 ... I dug into the first of these (figured out what library provided it, looked at the declared dependencies, poked around in the automake file) , and, as far as I could determine, it seems that mca_btl_openib.so simply isn't linked to list mca_mpool_grdma.so (which provides the symbol) as a dependency. Seeing as 1.10.7 is no longer supported. I figured I would try 2.1.3 in case this has been fixed. I compiled it up as well, and it seems all but the mca_fcoll_individual one have been resolved (I've replaced /nix/store/4kh0zbn8pmdqhvwagicswg70rwnpm570-openmpi-2.1.3 with ...) [mon241:05544] mca_base_component_repository_open: unable to open mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so: undefined