Re: [OMPI devel] Shared object dependencies

2018-06-12 Thread Jeff Squyres (jsquyres) via devel
On Jun 12, 2018, at 7:34 AM, Gabriel, Edgar  wrote:
> 
> Well, I am still confused. What is different on nixOS vs. other linux distros 
> that makes this error appear,

Fair enough.  I don't think I realized nixOS was a Linux distro.

That being said, every time I think I understand linkers, I find out that I 
don't know jack about linkers.  :-(

> and is it relevant enough for the backport or should we just go forward for 
> 4.0? Is it again a RTLD_GLOBAL issue as it was back 2014?

Yeah, we should probably figure this one out.  I don't know the answer here.

> And last but not least, I raised on the github discussion one series question 
> about the mca parameter names.
> 
> No way for 2 series backport btw., that version did not even have 
> common/ompio yet, that was introduced in the 3.0 release. 

Good to know.

-- 
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] Shared object dependencies

2018-06-12 Thread Gabriel, Edgar
Well, I am still confused. What is different on nixOS vs. other linux distros 
that makes this error appear, and is it relevant enough for the backport or 
should we just go forward for 4.0? Is it again a RTLD_GLOBAL issue as it was 
back 2014? And last but not least, I raised on the github discussion one series 
question about the mca parameter names.

No way for 2 series backport btw., that version did not even have common/ompio 
yet, that was introduced in the 3.0 release. 

Thanks
Edgar

> -Original Message-
> From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Jeff
> Squyres (jsquyres) via devel
> Sent: Tuesday, June 12, 2018 9:31 AM
> To: Open MPI Developers List 
> Cc: Jeff Squyres (jsquyres) 
> Subject: Re: [OMPI devel] Shared object dependencies
> 
> On Jun 12, 2018, at 7:21 AM, Gilles Gouaillardet
>  wrote:
> >
> > I think this also depends on the linker (configuration ?) and possibly the
> order the libraries are dlopen’ed.
> >
> > Note the issue was initially reported (as warnings only) from ompi_info, so
> there is a possibility it we all missed it.
> >
> > That being said, the errors make perfect sense to me.
> >
> > fwiw, I installed a NixOS virtual machine and reproduced the issue right
> away.
> 
> OIC -- right -- this was reported on NixOS, not vanilla Linux.  Ok.
> 
> These fixes will need to be back-ported to at least 3.0.x and 3.1.x, right?
> 
> Do they need to also go to v2.1.x?
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Shared object dependencies

2018-06-12 Thread Jeff Squyres (jsquyres) via devel
On Jun 12, 2018, at 7:21 AM, Gilles Gouaillardet 
 wrote:
> 
> I think this also depends on the linker (configuration ?) and possibly the 
> order the libraries are dlopen’ed.
> 
> Note the issue was initially reported (as warnings only) from ompi_info, so 
> there is a possibility it we all missed it.
> 
> That being said, the errors make perfect sense to me.
> 
> fwiw, I installed a NixOS virtual machine and reproduced the issue right away.

OIC -- right -- this was reported on NixOS, not vanilla Linux.  Ok.

These fixes will need to be back-ported to at least 3.0.x and 3.1.x, right?

Do they need to also go to v2.1.x?

-- 
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Shared object dependencies

2018-06-12 Thread Gilles Gouaillardet
I think this also depends on the linker (configuration ?) and possibly the
order the libraries are dlopen’ed.

Note the issue was initially reported (as warnings only) from ompi_info, so
there is a possibility it we all missed it.

That being said, the errors make perfect sense to me.

fwiw, I installed a NixOS virtual machine and reproduced the issue right
away.

Cheers,

Gilles

On Tuesday, June 12, 2018, Gabriel, Edgar  wrote:

> No, I do not use -disable-dlopen, this is the other thing that is
> confusing to me, how comes this error does not occur for anybody else.
> Thanks
> Edgar
>
> > -Original Message-
> > From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Jeff
> > Squyres (jsquyres) via devel
> > Sent: Tuesday, June 12, 2018 9:11 AM
> > To: Open MPI Developers List 
> > Cc: Jeff Squyres (jsquyres) 
> > Subject: Re: [OMPI devel] Shared object dependencies
> >
> > How is it that Edgar is not running into these issues?
> >
> > Edgar: are you compiling with --disable-dlopen, perchance?
> >
> >
> > > On Jun 12, 2018, at 6:04 AM, Gilles Gouaillardet
> >  wrote:
> > >
> > > Edgar,
> > >
> > > Regarding this specific problem, the issue is mca_fcoll_individual.so
> > > did not depend on libmca_commom_ompio.so, the PR does address that
> > > (among other abstraction violations)
> > >
> > > What about following up in github  ?
> > >
> > > Cheers,
> > >
> > > Gilles
> > >
> > > On Tuesday, June 12, 2018, Gabriel, Edgar 
> > wrote:
> > > So , I am still surprised to see this error message: if you look at
> lets say just
> > one error message (and all others are the same):
> > >
> > > > > [orc-login2:107400] mca_base_component_repository_open: unable to
> > > > > open
> > > > > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so:
> > > > > undefined symbol: mca_common_ompio_file_write (ignored)
> > >
> > > How comes that the symbol mca_common_ompio_file_write can not be
> > found ? It is in the common, that symbol should always be there, isn't
> it?
> > > Your fix Gilles (which we can discuss) will not address this problem
> in my
> > opinion. The symbols at this point that are accessed from the ompio
> > component are used through a function pointer, not by name, and that
> > should work in my opinion.(e.g. we do not call directly
> > mca_io_ompio_set_aggregator_props, but we call the function pointer fh-
> > >f_set_aggregator_props), and the same with the mca parmaeters, we access
> > them through a function that is stored as a function pointer on the file
> > handle structure.
> > >
> > > Thanks
> > > Edgar
> > >
> > >
> > > > -Original Message-
> > > > From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of
> > > > Gilles Gouaillardet
> > > > Sent: Tuesday, June 12, 2018 3:28 AM
> > > > To: devel@lists.open-mpi.org
> > > > Subject: Re: [OMPI devel] Shared object dependencies
> > > >
> > > > Tyson,
> > > >
> > > >
> > > > thanks for taking the time to do some more tests.
> > > >
> > > >
> > > > This is really a bug in Open MPI, and unlike what I thought earlier,
> > > > there are still
> > > >
> > > > some abstraction violations here and there related to ompio.
> > > >
> > > >
> > > > I filed https://github.com/open-mpi/ompi/pull/5263 in order to
> > > > address them
> > > >
> > > >
> > > > Meanwhile, you can configure Open MPI with --disable-dlopen and
> > > > hopefully, that will be
> > > >
> > > > enought to hide the issue.
> > > >
> > > >
> > > > Cheers,
> > > >
> > > >
> > > > Gilles
> > > >
> > > >
> > > > On 6/12/2018 5:58 AM, Tyson Whitehead wrote:
> > > > > I have now also tried release 3.1.0.  Same thing (were I have
> > > > > replaced
> > > > > /nix/store/glx60yay0hmmizhlxhqhnx9w3k4j9g1z-openmpi-3.1.0 with
> > > > > )
> > > > >
> > > > > [orc-login2:107400] mca_base_component_repository_open: unable to
> > > > > open
> > > > > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so:
> > > > > undefined symbol: mca_common_ompio_f

Re: [OMPI devel] Shared object dependencies

2018-06-12 Thread Gabriel, Edgar
No, I do not use -disable-dlopen, this is the other thing that is confusing to 
me, how comes this error does not occur for anybody else.
Thanks
Edgar

> -Original Message-
> From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Jeff
> Squyres (jsquyres) via devel
> Sent: Tuesday, June 12, 2018 9:11 AM
> To: Open MPI Developers List 
> Cc: Jeff Squyres (jsquyres) 
> Subject: Re: [OMPI devel] Shared object dependencies
> 
> How is it that Edgar is not running into these issues?
> 
> Edgar: are you compiling with --disable-dlopen, perchance?
> 
> 
> > On Jun 12, 2018, at 6:04 AM, Gilles Gouaillardet
>  wrote:
> >
> > Edgar,
> >
> > Regarding this specific problem, the issue is mca_fcoll_individual.so
> > did not depend on libmca_commom_ompio.so, the PR does address that
> > (among other abstraction violations)
> >
> > What about following up in github  ?
> >
> > Cheers,
> >
> > Gilles
> >
> > On Tuesday, June 12, 2018, Gabriel, Edgar 
> wrote:
> > So , I am still surprised to see this error message: if you look at lets 
> > say just
> one error message (and all others are the same):
> >
> > > > [orc-login2:107400] mca_base_component_repository_open: unable to
> > > > open
> > > > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so:
> > > > undefined symbol: mca_common_ompio_file_write (ignored)
> >
> > How comes that the symbol mca_common_ompio_file_write can not be
> found ? It is in the common, that symbol should always be there, isn't it?
> > Your fix Gilles (which we can discuss) will not address this problem in my
> opinion. The symbols at this point that are accessed from the ompio
> component are used through a function pointer, not by name, and that
> should work in my opinion.(e.g. we do not call directly
> mca_io_ompio_set_aggregator_props, but we call the function pointer fh-
> >f_set_aggregator_props), and the same with the mca parmaeters, we access
> them through a function that is stored as a function pointer on the file
> handle structure.
> >
> > Thanks
> > Edgar
> >
> >
> > > -Original Message-
> > > From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of
> > > Gilles Gouaillardet
> > > Sent: Tuesday, June 12, 2018 3:28 AM
> > > To: devel@lists.open-mpi.org
> > > Subject: Re: [OMPI devel] Shared object dependencies
> > >
> > > Tyson,
> > >
> > >
> > > thanks for taking the time to do some more tests.
> > >
> > >
> > > This is really a bug in Open MPI, and unlike what I thought earlier,
> > > there are still
> > >
> > > some abstraction violations here and there related to ompio.
> > >
> > >
> > > I filed https://github.com/open-mpi/ompi/pull/5263 in order to
> > > address them
> > >
> > >
> > > Meanwhile, you can configure Open MPI with --disable-dlopen and
> > > hopefully, that will be
> > >
> > > enought to hide the issue.
> > >
> > >
> > > Cheers,
> > >
> > >
> > > Gilles
> > >
> > >
> > > On 6/12/2018 5:58 AM, Tyson Whitehead wrote:
> > > > I have now also tried release 3.1.0.  Same thing (were I have
> > > > replaced
> > > > /nix/store/glx60yay0hmmizhlxhqhnx9w3k4j9g1z-openmpi-3.1.0 with
> > > > )
> > > >
> > > > [orc-login2:107400] mca_base_component_repository_open: unable to
> > > > open
> > > > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so:
> > > > undefined symbol: mca_common_ompio_file_write (ignored)
> > > > [orc-login2:107400] mca_base_component_repository_open: unable to
> > > > open
> > > > mca_fcoll_dynamic_gen2: .../lib/openmpi/mca_fcoll_dynamic_gen2.so:
> > > > undefined symbol: mca_common_ompio_register_print_entry (ignored)
> > > > [orc-login2:107400] mca_base_component_repository_open: unable to
> > > > open
> > > > mca_fcoll_dynamic: .../lib/openmpi/mca_fcoll_dynamic.so: undefined
> > > > symbol: mca_common_ompio_register_print_entry (ignored)
> > > > [orc-login2:107400] mca_base_component_repository_open: unable to
> > > > open
> > > > mca_fcoll_two_phase: .../lib/openmpi/mca_fcoll_two_phase.so:
> > > > undefined
> > > > symbol: mca_common_ompio_register_print_entry (ignored)
> > > > [orc-login2:107400] mca_base_component_repository_open

Re: [OMPI devel] Shared object dependencies

2018-06-12 Thread Jeff Squyres (jsquyres) via devel
How is it that Edgar is not running into these issues?

Edgar: are you compiling with --disable-dlopen, perchance?


> On Jun 12, 2018, at 6:04 AM, Gilles Gouaillardet 
>  wrote:
> 
> Edgar,
> 
> Regarding this specific problem, the issue is mca_fcoll_individual.so did not 
> depend on libmca_commom_ompio.so,
> the PR does address that (among other abstraction violations)
> 
> What about following up in github  ?
> 
> Cheers,
> 
> Gilles
> 
> On Tuesday, June 12, 2018, Gabriel, Edgar  wrote:
> So , I am still surprised to see this error message: if you look at lets say 
> just one error message (and all others are the same):
> 
> > > [orc-login2:107400] mca_base_component_repository_open: unable to open
> > > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so:
> > > undefined symbol: mca_common_ompio_file_write (ignored)
> 
> How comes that the symbol mca_common_ompio_file_write can not be found ? It 
> is in the common, that symbol should always be there, isn't it? 
> Your fix Gilles (which we can discuss) will not address this problem in my 
> opinion. The symbols at this point that are accessed from the ompio component 
> are used through a function pointer, not by name, and that should work in my 
> opinion.(e.g. we do not call directly mca_io_ompio_set_aggregator_props, but 
> we call the function pointer fh->f_set_aggregator_props), and the same with 
> the mca parmaeters, we access them through a function that is stored as a 
> function pointer on the file handle structure.
> 
> Thanks
> Edgar
>  
> 
> > -Original Message-
> > From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Gilles
> > Gouaillardet
> > Sent: Tuesday, June 12, 2018 3:28 AM
> > To: devel@lists.open-mpi.org
> > Subject: Re: [OMPI devel] Shared object dependencies
> > 
> > Tyson,
> > 
> > 
> > thanks for taking the time to do some more tests.
> > 
> > 
> > This is really a bug in Open MPI, and unlike what I thought earlier, there 
> > are
> > still
> > 
> > some abstraction violations here and there related to ompio.
> > 
> > 
> > I filed https://github.com/open-mpi/ompi/pull/5263 in order to address them
> > 
> > 
> > Meanwhile, you can configure Open MPI with --disable-dlopen and hopefully,
> > that will be
> > 
> > enought to hide the issue.
> > 
> > 
> > Cheers,
> > 
> > 
> > Gilles
> > 
> > 
> > On 6/12/2018 5:58 AM, Tyson Whitehead wrote:
> > > I have now also tried release 3.1.0.  Same thing (were I have replaced
> > > /nix/store/glx60yay0hmmizhlxhqhnx9w3k4j9g1z-openmpi-3.1.0 with )
> > >
> > > [orc-login2:107400] mca_base_component_repository_open: unable to open
> > > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so:
> > > undefined symbol: mca_common_ompio_file_write (ignored)
> > > [orc-login2:107400] mca_base_component_repository_open: unable to open
> > > mca_fcoll_dynamic_gen2: .../lib/openmpi/mca_fcoll_dynamic_gen2.so:
> > > undefined symbol: mca_common_ompio_register_print_entry (ignored)
> > > [orc-login2:107400] mca_base_component_repository_open: unable to open
> > > mca_fcoll_dynamic: .../lib/openmpi/mca_fcoll_dynamic.so: undefined
> > > symbol: mca_common_ompio_register_print_entry (ignored)
> > > [orc-login2:107400] mca_base_component_repository_open: unable to open
> > > mca_fcoll_two_phase: .../lib/openmpi/mca_fcoll_two_phase.so: undefined
> > > symbol: mca_common_ompio_register_print_entry (ignored)
> > > [orc-login2:107400] mca_base_component_repository_open: unable to open
> > > mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so: undefined
> > > symbol: mca_common_ompio_register_print_entry (ignored)
> > >   Package: Open MPI nixbld@localhost Distribution
> > >  Open MPI: 3.1.0
> > >Open MPI repo revision: v3.1.0
> > > Open MPI release date: May 07, 2018
> > >      Open RTE: 3.1.0
> > >Open RTE repo revision: v3.1.0
> > > Open RTE release date: May 07, 2018
> > >  OPAL: 3.1.0
> > > OPAL repo revision: v3.1.0
> > > OPAL release date: May 07, 2018
> > >
> > > I straced the process, and, as far as I could tell, it was just mostly
> > > opening the shared objects in alphabetical order.  Would appreciate
> > > any insight, such as whether this is normal behaviour I can ignore or
> > > not?
> > >
> >

Re: [OMPI devel] Shared object dependencies

2018-06-12 Thread Gilles Gouaillardet
Edgar,

Regarding this specific problem, the issue is mca_fcoll_individual.so did
not depend on libmca_commom_ompio.so,
the PR does address that (among other abstraction violations)

What about following up in github  ?

Cheers,

Gilles

On Tuesday, June 12, 2018, Gabriel, Edgar  wrote:

> So , I am still surprised to see this error message: if you look at lets
> say just one error message (and all others are the same):
>
> > > [orc-login2:107400] mca_base_component_repository_open: unable to open
> > > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so:
> > > undefined symbol: mca_common_ompio_file_write (ignored)
>
> How comes that the symbol mca_common_ompio_file_write can not be found ?
> It is in the common, that symbol should always be there, isn't it?
> Your fix Gilles (which we can discuss) will not address this problem in my
> opinion. The symbols at this point that are accessed from the ompio
> component are used through a function pointer, not by name, and that should
> work in my opinion.(e.g. we do not call directly
> mca_io_ompio_set_aggregator_props, but we call the function pointer
> fh->f_set_aggregator_props), and the same with the mca parmaeters, we
> access them through a function that is stored as a function pointer on the
> file handle structure.
>
> Thanks
> Edgar
>
>
> > -Original Message-
> > From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of
> Gilles
> > Gouaillardet
> > Sent: Tuesday, June 12, 2018 3:28 AM
> > To: devel@lists.open-mpi.org
> > Subject: Re: [OMPI devel] Shared object dependencies
> >
> > Tyson,
> >
> >
> > thanks for taking the time to do some more tests.
> >
> >
> > This is really a bug in Open MPI, and unlike what I thought earlier,
> there are
> > still
> >
> > some abstraction violations here and there related to ompio.
> >
> >
> > I filed https://github.com/open-mpi/ompi/pull/5263 in order to address
> them
> >
> >
> > Meanwhile, you can configure Open MPI with --disable-dlopen and
> hopefully,
> > that will be
> >
> > enought to hide the issue.
> >
> >
> > Cheers,
> >
> >
> > Gilles
> >
> >
> > On 6/12/2018 5:58 AM, Tyson Whitehead wrote:
> > > I have now also tried release 3.1.0.  Same thing (were I have replaced
> > > /nix/store/glx60yay0hmmizhlxhqhnx9w3k4j9g1z-openmpi-3.1.0 with )
> > >
> > > [orc-login2:107400] mca_base_component_repository_open: unable to open
> > > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so:
> > > undefined symbol: mca_common_ompio_file_write (ignored)
> > > [orc-login2:107400] mca_base_component_repository_open: unable to open
> > > mca_fcoll_dynamic_gen2: .../lib/openmpi/mca_fcoll_dynamic_gen2.so:
> > > undefined symbol: mca_common_ompio_register_print_entry (ignored)
> > > [orc-login2:107400] mca_base_component_repository_open: unable to open
> > > mca_fcoll_dynamic: .../lib/openmpi/mca_fcoll_dynamic.so: undefined
> > > symbol: mca_common_ompio_register_print_entry (ignored)
> > > [orc-login2:107400] mca_base_component_repository_open: unable to open
> > > mca_fcoll_two_phase: .../lib/openmpi/mca_fcoll_two_phase.so: undefined
> > > symbol: mca_common_ompio_register_print_entry (ignored)
> > > [orc-login2:107400] mca_base_component_repository_open: unable to open
> > > mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so: undefined
> > > symbol: mca_common_ompio_register_print_entry (ignored)
> > >   Package: Open MPI nixbld@localhost Distribution
> > >  Open MPI: 3.1.0
> > >Open MPI repo revision: v3.1.0
> > > Open MPI release date: May 07, 2018
> > >      Open RTE: 3.1.0
> > >Open RTE repo revision: v3.1.0
> > > Open RTE release date: May 07, 2018
> > >  OPAL: 3.1.0
> > > OPAL repo revision: v3.1.0
> > > OPAL release date: May 07, 2018
> > >
> > > I straced the process, and, as far as I could tell, it was just mostly
> > > opening the shared objects in alphabetical order.  Would appreciate
> > > any insight, such as whether this is normal behaviour I can ignore or
> > > not?
> > >
> > > Thanks!  -Tyson
> > > On Fri, 8 Jun 2018 at 17:37, Tyson Whitehead 
> > wrote:
> > >> This email starts out talking about version 1.10.7 to give a complete
> > >> picture.  I tested 2.1.3 as well, it also exhibits this issue,
> > >

Re: [OMPI devel] Shared object dependencies

2018-06-12 Thread Gabriel, Edgar
So , I am still surprised to see this error message: if you look at lets say 
just one error message (and all others are the same):

> > [orc-login2:107400] mca_base_component_repository_open: unable to open
> > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so:
> > undefined symbol: mca_common_ompio_file_write (ignored)

How comes that the symbol mca_common_ompio_file_write can not be found ? It is 
in the common, that symbol should always be there, isn't it? 
Your fix Gilles (which we can discuss) will not address this problem in my 
opinion. The symbols at this point that are accessed from the ompio component 
are used through a function pointer, not by name, and that should work in my 
opinion.(e.g. we do not call directly mca_io_ompio_set_aggregator_props, but we 
call the function pointer fh->f_set_aggregator_props), and the same with the 
mca parmaeters, we access them through a function that is stored as a function 
pointer on the file handle structure.

Thanks
Edgar
 

> -Original Message-
> From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Gilles
> Gouaillardet
> Sent: Tuesday, June 12, 2018 3:28 AM
> To: devel@lists.open-mpi.org
> Subject: Re: [OMPI devel] Shared object dependencies
> 
> Tyson,
> 
> 
> thanks for taking the time to do some more tests.
> 
> 
> This is really a bug in Open MPI, and unlike what I thought earlier, there are
> still
> 
> some abstraction violations here and there related to ompio.
> 
> 
> I filed https://github.com/open-mpi/ompi/pull/5263 in order to address them
> 
> 
> Meanwhile, you can configure Open MPI with --disable-dlopen and hopefully,
> that will be
> 
> enought to hide the issue.
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> 
> On 6/12/2018 5:58 AM, Tyson Whitehead wrote:
> > I have now also tried release 3.1.0.  Same thing (were I have replaced
> > /nix/store/glx60yay0hmmizhlxhqhnx9w3k4j9g1z-openmpi-3.1.0 with )
> >
> > [orc-login2:107400] mca_base_component_repository_open: unable to open
> > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so:
> > undefined symbol: mca_common_ompio_file_write (ignored)
> > [orc-login2:107400] mca_base_component_repository_open: unable to open
> > mca_fcoll_dynamic_gen2: .../lib/openmpi/mca_fcoll_dynamic_gen2.so:
> > undefined symbol: mca_common_ompio_register_print_entry (ignored)
> > [orc-login2:107400] mca_base_component_repository_open: unable to open
> > mca_fcoll_dynamic: .../lib/openmpi/mca_fcoll_dynamic.so: undefined
> > symbol: mca_common_ompio_register_print_entry (ignored)
> > [orc-login2:107400] mca_base_component_repository_open: unable to open
> > mca_fcoll_two_phase: .../lib/openmpi/mca_fcoll_two_phase.so: undefined
> > symbol: mca_common_ompio_register_print_entry (ignored)
> > [orc-login2:107400] mca_base_component_repository_open: unable to open
> > mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so: undefined
> > symbol: mca_common_ompio_register_print_entry (ignored)
> >   Package: Open MPI nixbld@localhost Distribution
> >  Open MPI: 3.1.0
> >Open MPI repo revision: v3.1.0
> > Open MPI release date: May 07, 2018
> >      Open RTE: 3.1.0
> >Open RTE repo revision: v3.1.0
> > Open RTE release date: May 07, 2018
> >  OPAL: 3.1.0
> > OPAL repo revision: v3.1.0
> > OPAL release date: May 07, 2018
> >
> > I straced the process, and, as far as I could tell, it was just mostly
> > opening the shared objects in alphabetical order.  Would appreciate
> > any insight, such as whether this is normal behaviour I can ignore or
> > not?
> >
> > Thanks!  -Tyson
> > On Fri, 8 Jun 2018 at 17:37, Tyson Whitehead 
> wrote:
> >> This email starts out talking about version 1.10.7 to give a complete
> >> picture.  I tested 2.1.3 as well, it also exhibits this issue,
> >> although to a lesser extent though, and am asking for help on that
> >> release.
> >>
> >> I was compiling the OpenMPI 1.10.7 shipped with NixOS against a newer
> >> libibverbs with a large set of drivers and get some strange errors
> >> when when running opmi_info (I've replaced the common prefix
> >> /nix/store/9zm0pqsh67fw0xi5cpnybnd7hgzryffs-openmpi-1.10.7 with ...)
> >>
> >> [mon241:04077] mca: base: component_find: unable to open
> >> .../lib/openmpi/mca_btl_openib: .../lib/openmpi/mca_btl_openib.so:
> >> undefined symbol: mca_mpool_grdma_evict (ignored) [mon241:04077]
> mca:
> >> base: component_find: unable to open
> >> .../lib/openm

Re: [OMPI devel] Shared object dependencies

2018-06-12 Thread Gilles Gouaillardet

Tyson,


thanks for taking the time to do some more tests.


This is really a bug in Open MPI, and unlike what I thought earlier, 
there are still


some abstraction violations here and there related to ompio.


I filed https://github.com/open-mpi/ompi/pull/5263 in order to address them


Meanwhile, you can configure Open MPI with --disable-dlopen and 
hopefully, that will be


enought to hide the issue.


Cheers,


Gilles


On 6/12/2018 5:58 AM, Tyson Whitehead wrote:

I have now also tried release 3.1.0.  Same thing (were I have replaced
/nix/store/glx60yay0hmmizhlxhqhnx9w3k4j9g1z-openmpi-3.1.0 with )

[orc-login2:107400] mca_base_component_repository_open: unable to open
mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so:
undefined symbol: mca_common_ompio_file_write (ignored)
[orc-login2:107400] mca_base_component_repository_open: unable to open
mca_fcoll_dynamic_gen2: .../lib/openmpi/mca_fcoll_dynamic_gen2.so:
undefined symbol: mca_common_ompio_register_print_entry (ignored)
[orc-login2:107400] mca_base_component_repository_open: unable to open
mca_fcoll_dynamic: .../lib/openmpi/mca_fcoll_dynamic.so: undefined
symbol: mca_common_ompio_register_print_entry (ignored)
[orc-login2:107400] mca_base_component_repository_open: unable to open
mca_fcoll_two_phase: .../lib/openmpi/mca_fcoll_two_phase.so: undefined
symbol: mca_common_ompio_register_print_entry (ignored)
[orc-login2:107400] mca_base_component_repository_open: unable to open
mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so: undefined
symbol: mca_common_ompio_register_print_entry (ignored)
  Package: Open MPI nixbld@localhost Distribution
 Open MPI: 3.1.0
   Open MPI repo revision: v3.1.0
Open MPI release date: May 07, 2018
     Open RTE: 3.1.0
   Open RTE repo revision: v3.1.0
Open RTE release date: May 07, 2018
 OPAL: 3.1.0
OPAL repo revision: v3.1.0
OPAL release date: May 07, 2018

I straced the process, and, as far as I could tell, it was just mostly
opening the shared objects in alphabetical order.  Would appreciate
any insight, such as whether this is normal behaviour I can ignore or
not?

Thanks!  -Tyson
On Fri, 8 Jun 2018 at 17:37, Tyson Whitehead  wrote:

This email starts out talking about version 1.10.7 to give a complete
picture.  I tested 2.1.3 as well, it also exhibits this issue,
although to a lesser extent though, and am asking for help on that
release.

I was compiling the OpenMPI 1.10.7 shipped with NixOS against a newer
libibverbs with a large set of drivers and get some strange errors
when when running opmi_info (I've replaced the common prefix
/nix/store/9zm0pqsh67fw0xi5cpnybnd7hgzryffs-openmpi-1.10.7 with ...)

[mon241:04077] mca: base: component_find: unable to open
.../lib/openmpi/mca_btl_openib: .../lib/openmpi/mca_btl_openib.so:
undefined symbol: mca_mpool_grdma_evict (ignored)
[mon241:04077] mca: base: component_find: unable to open
.../lib/openmpi/mca_fcoll_individual:
.../lib/openmpi/mca_fcoll_individual.so: undefined symbol:
mca_io_ompio_file_write (ignored)
[mon241:04077] mca: base: component_find: unable to open
.../lib/openmpi/mca_fcoll_ylib: .../lib/openmpi/mca_fcoll_ylib.so:
undefined symbol: ompi_io_ompio_scatter_data (ignored)
[mon241:04077] mca: base: component_find: unable to open
.../lib/openmpi/mca_fcoll_dynamic:
.../lib/openmpi/mca_fcoll_dynamic.so: undefined symbol:
ompi_io_ompio_allgatherv_array (ignored)
[mon241:04077] mca: base: component_find: unable to open
.../lib/openmpi/mca_fcoll_two_phase:
.../lib/openmpi/mca_fcoll_two_phase.so: undefined symbol:
ompi_io_ompio_set_aggregator_props (ignored)
[mon241:04077] mca: base: component_find: unable to open
.../lib/openmpi/mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so:
undefined symbol: ompi_io_ompio_allgather_array (ignored)
  Package: Open MPI nixbld@ Distribution
Open MPI: 1.10.7
  Open MPI repo revision: v1.10.6-48-g5e373bf
   Open MPI release date: May 16, 2017
Open RTE: 1.10.7
  Open RTE repo revision: v1.10.6-48-g5e373bf
   Open RTE release date: May 16, 2017
OPAL: 1.10.7
  OPAL repo revision: v1.10.6-48-g5e373bf
   OPAL release date: May 16, 2017
...

I dug into the first of these (figured out what library provided it,
looked at the declared dependencies, poked around in the automake
file) , and, as far as I could determine, it seems that
mca_btl_openib.so simply isn't linked to list mca_mpool_grdma.so
(which provides the symbol) as a dependency.

Seeing as 1.10.7 is no longer supported.  I figured I would try 2.1.3
in case this has been fixed.  I compiled it up as well, and it seems
all but the mca_fcoll_individual one have been resolved (I've replaced
/nix/store/4kh0zbn8pmdqhvwagicswg70rwnpm570-openmpi-2.1.3 with ...)

[mon241:05544] mca_base_component_repository_open: unable to open
mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so:

Re: [OMPI devel] Shared object dependencies

2018-06-11 Thread Tyson Whitehead
I have now also tried release 3.1.0.  Same thing (were I have replaced
/nix/store/glx60yay0hmmizhlxhqhnx9w3k4j9g1z-openmpi-3.1.0 with )

[orc-login2:107400] mca_base_component_repository_open: unable to open
mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so:
undefined symbol: mca_common_ompio_file_write (ignored)
[orc-login2:107400] mca_base_component_repository_open: unable to open
mca_fcoll_dynamic_gen2: .../lib/openmpi/mca_fcoll_dynamic_gen2.so:
undefined symbol: mca_common_ompio_register_print_entry (ignored)
[orc-login2:107400] mca_base_component_repository_open: unable to open
mca_fcoll_dynamic: .../lib/openmpi/mca_fcoll_dynamic.so: undefined
symbol: mca_common_ompio_register_print_entry (ignored)
[orc-login2:107400] mca_base_component_repository_open: unable to open
mca_fcoll_two_phase: .../lib/openmpi/mca_fcoll_two_phase.so: undefined
symbol: mca_common_ompio_register_print_entry (ignored)
[orc-login2:107400] mca_base_component_repository_open: unable to open
mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so: undefined
symbol: mca_common_ompio_register_print_entry (ignored)
 Package: Open MPI nixbld@localhost Distribution
Open MPI: 3.1.0
  Open MPI repo revision: v3.1.0
   Open MPI release date: May 07, 2018
    Open RTE: 3.1.0
  Open RTE repo revision: v3.1.0
   Open RTE release date: May 07, 2018
OPAL: 3.1.0
   OPAL repo revision: v3.1.0
   OPAL release date: May 07, 2018

I straced the process, and, as far as I could tell, it was just mostly
opening the shared objects in alphabetical order.  Would appreciate
any insight, such as whether this is normal behaviour I can ignore or
not?

Thanks!  -Tyson
On Fri, 8 Jun 2018 at 17:37, Tyson Whitehead  wrote:
>
> This email starts out talking about version 1.10.7 to give a complete
> picture.  I tested 2.1.3 as well, it also exhibits this issue,
> although to a lesser extent though, and am asking for help on that
> release.
>
> I was compiling the OpenMPI 1.10.7 shipped with NixOS against a newer
> libibverbs with a large set of drivers and get some strange errors
> when when running opmi_info (I've replaced the common prefix
> /nix/store/9zm0pqsh67fw0xi5cpnybnd7hgzryffs-openmpi-1.10.7 with ...)
>
> [mon241:04077] mca: base: component_find: unable to open
> .../lib/openmpi/mca_btl_openib: .../lib/openmpi/mca_btl_openib.so:
> undefined symbol: mca_mpool_grdma_evict (ignored)
> [mon241:04077] mca: base: component_find: unable to open
> .../lib/openmpi/mca_fcoll_individual:
> .../lib/openmpi/mca_fcoll_individual.so: undefined symbol:
> mca_io_ompio_file_write (ignored)
> [mon241:04077] mca: base: component_find: unable to open
> .../lib/openmpi/mca_fcoll_ylib: .../lib/openmpi/mca_fcoll_ylib.so:
> undefined symbol: ompi_io_ompio_scatter_data (ignored)
> [mon241:04077] mca: base: component_find: unable to open
> .../lib/openmpi/mca_fcoll_dynamic:
> .../lib/openmpi/mca_fcoll_dynamic.so: undefined symbol:
> ompi_io_ompio_allgatherv_array (ignored)
> [mon241:04077] mca: base: component_find: unable to open
> .../lib/openmpi/mca_fcoll_two_phase:
> .../lib/openmpi/mca_fcoll_two_phase.so: undefined symbol:
> ompi_io_ompio_set_aggregator_props (ignored)
> [mon241:04077] mca: base: component_find: unable to open
> .../lib/openmpi/mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so:
> undefined symbol: ompi_io_ompio_allgather_array (ignored)
>  Package: Open MPI nixbld@ Distribution
>Open MPI: 1.10.7
>  Open MPI repo revision: v1.10.6-48-g5e373bf
>   Open MPI release date: May 16, 2017
>Open RTE: 1.10.7
>  Open RTE repo revision: v1.10.6-48-g5e373bf
>   Open RTE release date: May 16, 2017
>OPAL: 1.10.7
>  OPAL repo revision: v1.10.6-48-g5e373bf
>   OPAL release date: May 16, 2017
> ...
>
> I dug into the first of these (figured out what library provided it,
> looked at the declared dependencies, poked around in the automake
> file) , and, as far as I could determine, it seems that
> mca_btl_openib.so simply isn't linked to list mca_mpool_grdma.so
> (which provides the symbol) as a dependency.
>
> Seeing as 1.10.7 is no longer supported.  I figured I would try 2.1.3
> in case this has been fixed.  I compiled it up as well, and it seems
> all but the mca_fcoll_individual one have been resolved (I've replaced
> /nix/store/4kh0zbn8pmdqhvwagicswg70rwnpm570-openmpi-2.1.3 with ...)
>
> [mon241:05544] mca_base_component_repository_open: unable to open
> mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so:
> undefined symbol: ompio_io_ompio_file_read (ignored)
>  Package: Open MPI nixbld@ Distribution
>Open MPI: 2.1.3
>  Open MPI repo revision: v2.1.2-129-gcfd8f3f
>   Open MPI release date: Mar 13, 2018
>Open RTE: 2.1.3
>  Open RTE repo revision: v2.1.2-129-gcfd8f3f
>   Open RTE release date: Mar 13, 2018
>OPAL: 2.1.3
>  

Re: [OMPI devel] Shared object dependencies

2018-06-10 Thread Gilles Gouaillardet
Edgar,

I checked the various release branches, and I think this issue was
fixed by 
https://github.com/open-mpi/ompi/commit/ccf76b779130e065de326f71fe6bac868c565300

This was back-ported into the v3.0.x branch, and that was before the
v3.1.x branch was created.

This has *not* been backported into the v2.x series, and as far as I
am concerned, that would fix the abstraction violation I mentioned
earlier.

I noted the fcoll framework is open is mca_io_base_file_select(), so
an other (a bit convoluted imho, but that could require less changes)
way could be to open the framework in the io/ompio component.


Cheers,

Gilles
On Sat, Jun 9, 2018 at 7:59 AM Gabriel, Edgar  wrote:
>
> I wanted to add one item before I forget (although I agree with what Jeff 
> said): The error messages shown reminds me of the problem that we had with 
> ompio  in 1.8/1.10 series when the RTLD_GLOBAL  option was not correctly set. 
> However, that was fixed in the 2.0 series and going forward, so if that shows 
> up with later releases, it might an indication of something else.
>
> Edgar
>
> > -Original Message-
> > From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Jeff
> > Squyres (jsquyres) via devel
> > Sent: Friday, June 8, 2018 4:54 PM
> > To: Open MPI Developers List 
> > Cc: Jeff Squyres (jsquyres) 
> > Subject: Re: [OMPI devel] Shared object dependencies
> >
> > Before digging any deeper, did you perchance install multiple versions of 
> > Open
> > MPI into the same prefix?
> >
> > If so, remember that Open MPI installs lots of plugins.  The exact set of 
> > plugins
> > changes every release.  So if you install version A.B.C in to /opt/openmpi, 
> > and
> > then install version X.Y.Z in to /opt/openmpi, note that the installation 
> > of X.Y.Z
> > did not *uninstall* A.B.C first.  Hence, you might still have some stale 
> > A.B.C
> > components in the tree that Open MPI X.Y.Z may try to open.  Since the
> > underlying libraries that these plugins use have now been upgraded to X.Y.Z,
> > the stale A.B.C component may (and likely will) fail to open.
> >
> > If that's not what is happening, let us know and we can dig deeper.
> >
> >
> > > On Jun 8, 2018, at 5:37 PM, Tyson Whitehead 
> > wrote:
> > >
> > > This email starts out talking about version 1.10.7 to give a complete
> > > picture.  I tested 2.1.3 as well, it also exhibits this issue,
> > > although to a lesser extent though, and am asking for help on that
> > > release.
> > >
> > > I was compiling the OpenMPI 1.10.7 shipped with NixOS against a newer
> > > libibverbs with a large set of drivers and get some strange errors
> > > when when running opmi_info (I've replaced the common prefix
> > > /nix/store/9zm0pqsh67fw0xi5cpnybnd7hgzryffs-openmpi-1.10.7 with ...)
> > >
> > > [mon241:04077] mca: base: component_find: unable to open
> > > .../lib/openmpi/mca_btl_openib: .../lib/openmpi/mca_btl_openib.so:
> > > undefined symbol: mca_mpool_grdma_evict (ignored) [mon241:04077] mca:
> > > base: component_find: unable to open
> > > .../lib/openmpi/mca_fcoll_individual:
> > > .../lib/openmpi/mca_fcoll_individual.so: undefined symbol:
> > > mca_io_ompio_file_write (ignored)
> > > [mon241:04077] mca: base: component_find: unable to open
> > > .../lib/openmpi/mca_fcoll_ylib: .../lib/openmpi/mca_fcoll_ylib.so:
> > > undefined symbol: ompi_io_ompio_scatter_data (ignored) [mon241:04077]
> > > mca: base: component_find: unable to open
> > > .../lib/openmpi/mca_fcoll_dynamic:
> > > .../lib/openmpi/mca_fcoll_dynamic.so: undefined symbol:
> > > ompi_io_ompio_allgatherv_array (ignored) [mon241:04077] mca: base:
> > > component_find: unable to open
> > > .../lib/openmpi/mca_fcoll_two_phase:
> > > .../lib/openmpi/mca_fcoll_two_phase.so: undefined symbol:
> > > ompi_io_ompio_set_aggregator_props (ignored) [mon241:04077] mca: base:
> > > component_find: unable to open
> > > .../lib/openmpi/mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so:
> > > undefined symbol: ompi_io_ompio_allgather_array (ignored)
> > > Package: Open MPI nixbld@ Distribution
> > >   Open MPI: 1.10.7
> > > Open MPI repo revision: v1.10.6-48-g5e373bf  Open MPI release date:
> > > May 16, 2017
> > >   Open RTE: 1.10.7
> > > Open RTE repo revision: v1.10.6-48-g5e373bf  Open RTE release date:
> > > May 16, 2017
> > >   OPAL: 1.10.7
> > > OPAL repo revision: v1.

Re: [OMPI devel] Shared object dependencies

2018-06-08 Thread Gabriel, Edgar
I wanted to add one item before I forget (although I agree with what Jeff 
said): The error messages shown reminds me of the problem that we had with 
ompio  in 1.8/1.10 series when the RTLD_GLOBAL  option was not correctly set. 
However, that was fixed in the 2.0 series and going forward, so if that shows 
up with later releases, it might an indication of something else.

Edgar 

> -Original Message-
> From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Jeff
> Squyres (jsquyres) via devel
> Sent: Friday, June 8, 2018 4:54 PM
> To: Open MPI Developers List 
> Cc: Jeff Squyres (jsquyres) 
> Subject: Re: [OMPI devel] Shared object dependencies
> 
> Before digging any deeper, did you perchance install multiple versions of Open
> MPI into the same prefix?
> 
> If so, remember that Open MPI installs lots of plugins.  The exact set of 
> plugins
> changes every release.  So if you install version A.B.C in to /opt/openmpi, 
> and
> then install version X.Y.Z in to /opt/openmpi, note that the installation of 
> X.Y.Z
> did not *uninstall* A.B.C first.  Hence, you might still have some stale A.B.C
> components in the tree that Open MPI X.Y.Z may try to open.  Since the
> underlying libraries that these plugins use have now been upgraded to X.Y.Z,
> the stale A.B.C component may (and likely will) fail to open.
> 
> If that's not what is happening, let us know and we can dig deeper.
> 
> 
> > On Jun 8, 2018, at 5:37 PM, Tyson Whitehead 
> wrote:
> >
> > This email starts out talking about version 1.10.7 to give a complete
> > picture.  I tested 2.1.3 as well, it also exhibits this issue,
> > although to a lesser extent though, and am asking for help on that
> > release.
> >
> > I was compiling the OpenMPI 1.10.7 shipped with NixOS against a newer
> > libibverbs with a large set of drivers and get some strange errors
> > when when running opmi_info (I've replaced the common prefix
> > /nix/store/9zm0pqsh67fw0xi5cpnybnd7hgzryffs-openmpi-1.10.7 with ...)
> >
> > [mon241:04077] mca: base: component_find: unable to open
> > .../lib/openmpi/mca_btl_openib: .../lib/openmpi/mca_btl_openib.so:
> > undefined symbol: mca_mpool_grdma_evict (ignored) [mon241:04077] mca:
> > base: component_find: unable to open
> > .../lib/openmpi/mca_fcoll_individual:
> > .../lib/openmpi/mca_fcoll_individual.so: undefined symbol:
> > mca_io_ompio_file_write (ignored)
> > [mon241:04077] mca: base: component_find: unable to open
> > .../lib/openmpi/mca_fcoll_ylib: .../lib/openmpi/mca_fcoll_ylib.so:
> > undefined symbol: ompi_io_ompio_scatter_data (ignored) [mon241:04077]
> > mca: base: component_find: unable to open
> > .../lib/openmpi/mca_fcoll_dynamic:
> > .../lib/openmpi/mca_fcoll_dynamic.so: undefined symbol:
> > ompi_io_ompio_allgatherv_array (ignored) [mon241:04077] mca: base:
> > component_find: unable to open
> > .../lib/openmpi/mca_fcoll_two_phase:
> > .../lib/openmpi/mca_fcoll_two_phase.so: undefined symbol:
> > ompi_io_ompio_set_aggregator_props (ignored) [mon241:04077] mca: base:
> > component_find: unable to open
> > .../lib/openmpi/mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so:
> > undefined symbol: ompi_io_ompio_allgather_array (ignored)
> > Package: Open MPI nixbld@ Distribution
> >   Open MPI: 1.10.7
> > Open MPI repo revision: v1.10.6-48-g5e373bf  Open MPI release date:
> > May 16, 2017
> >   Open RTE: 1.10.7
> > Open RTE repo revision: v1.10.6-48-g5e373bf  Open RTE release date:
> > May 16, 2017
> >   OPAL: 1.10.7
> > OPAL repo revision: v1.10.6-48-g5e373bf
> >  OPAL release date: May 16, 2017
> > ...
> >
> > I dug into the first of these (figured out what library provided it,
> > looked at the declared dependencies, poked around in the automake
> > file) , and, as far as I could determine, it seems that
> > mca_btl_openib.so simply isn't linked to list mca_mpool_grdma.so
> > (which provides the symbol) as a dependency.
> >
> > Seeing as 1.10.7 is no longer supported.  I figured I would try 2.1.3
> > in case this has been fixed.  I compiled it up as well, and it seems
> > all but the mca_fcoll_individual one have been resolved (I've replaced
> > /nix/store/4kh0zbn8pmdqhvwagicswg70rwnpm570-openmpi-2.1.3 with ...)
> >
> > [mon241:05544] mca_base_component_repository_open: unable to open
> > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so:
> > undefined symbol: ompio_io_ompio_file_read (ignored)
> > Package: Open MPI nixbld@ Distribution
> >   Open MPI: 2.

Re: [OMPI devel] Shared object dependencies

2018-06-08 Thread gilles
Assuming this is a fresh install, and at first glance, there is an 
abstraction violation here

fcoll/individual should not directly call a function of io/ompio

NixOS linker might be very strict and that could explain why this 
mistake was unnoticed until now.

Would you mind giving the latest Open MPI 3.1 a try ?

Cheers,

Gilles

- Original Message -
> Before digging any deeper, did you perchance install multiple versions 
of Open MPI into the same prefix?
> 
> If so, remember that Open MPI installs lots of plugins.  The exact set 
of plugins changes every release.  So if you install version A.B.C in to 
/opt/openmpi, and then install version X.Y.Z in to /opt/openmpi, note 
that the installation of X.Y.Z did not *uninstall* A.B.C first.  Hence, 
you might still have some stale A.B.C components in the tree that Open 
MPI X.Y.Z may try to open.  Since the underlying libraries that these 
plugins use have now been upgraded to X.Y.Z, the stale A.B.C component 
may (and likely will) fail to open.
> 
> If that's not what is happening, let us know and we can dig deeper.
> 
> 
> > On Jun 8, 2018, at 5:37 PM, Tyson Whitehead  
wrote:
> > 
> > This email starts out talking about version 1.10.7 to give a 
complete
> > picture.  I tested 2.1.3 as well, it also exhibits this issue,
> > although to a lesser extent though, and am asking for help on that
> > release.
> > 
> > I was compiling the OpenMPI 1.10.7 shipped with NixOS against a 
newer
> > libibverbs with a large set of drivers and get some strange errors
> > when when running opmi_info (I've replaced the common prefix
> > /nix/store/9zm0pqsh67fw0xi5cpnybnd7hgzryffs-openmpi-1.10.7 with ...)
> > 
> > [mon241:04077] mca: base: component_find: unable to open
> > .../lib/openmpi/mca_btl_openib: .../lib/openmpi/mca_btl_openib.so:
> > undefined symbol: mca_mpool_grdma_evict (ignored)
> > [mon241:04077] mca: base: component_find: unable to open
> > .../lib/openmpi/mca_fcoll_individual:
> > .../lib/openmpi/mca_fcoll_individual.so: undefined symbol:
> > mca_io_ompio_file_write (ignored)
> > [mon241:04077] mca: base: component_find: unable to open
> > .../lib/openmpi/mca_fcoll_ylib: .../lib/openmpi/mca_fcoll_ylib.so:
> > undefined symbol: ompi_io_ompio_scatter_data (ignored)
> > [mon241:04077] mca: base: component_find: unable to open
> > .../lib/openmpi/mca_fcoll_dynamic:
> > .../lib/openmpi/mca_fcoll_dynamic.so: undefined symbol:
> > ompi_io_ompio_allgatherv_array (ignored)
> > [mon241:04077] mca: base: component_find: unable to open
> > .../lib/openmpi/mca_fcoll_two_phase:
> > .../lib/openmpi/mca_fcoll_two_phase.so: undefined symbol:
> > ompi_io_ompio_set_aggregator_props (ignored)
> > [mon241:04077] mca: base: component_find: unable to open
> > .../lib/openmpi/mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.
so:
> > undefined symbol: ompi_io_ompio_allgather_array (ignored)
> > Package: Open MPI nixbld@ Distribution
> >   Open MPI: 1.10.7
> > Open MPI repo revision: v1.10.6-48-g5e373bf
> >  Open MPI release date: May 16, 2017
> >   Open RTE: 1.10.7
> > Open RTE repo revision: v1.10.6-48-g5e373bf
> >  Open RTE release date: May 16, 2017
> >   OPAL: 1.10.7
> > OPAL repo revision: v1.10.6-48-g5e373bf
> >  OPAL release date: May 16, 2017
> > ...
> > 
> > I dug into the first of these (figured out what library provided it,
> > looked at the declared dependencies, poked around in the automake
> > file) , and, as far as I could determine, it seems that
> > mca_btl_openib.so simply isn't linked to list mca_mpool_grdma.so
> > (which provides the symbol) as a dependency.
> > 
> > Seeing as 1.10.7 is no longer supported.  I figured I would try 2.1.
3
> > in case this has been fixed.  I compiled it up as well, and it seems
> > all but the mca_fcoll_individual one have been resolved (I've 
replaced
> > /nix/store/4kh0zbn8pmdqhvwagicswg70rwnpm570-openmpi-2.1.3 with ...)
> > 
> > [mon241:05544] mca_base_component_repository_open: unable to open
> > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so:
> > undefined symbol: ompio_io_ompio_file_read (ignored)
> > Package: Open MPI nixbld@ Distribution
> >   Open MPI: 2.1.3
> > Open MPI repo revision: v2.1.2-129-gcfd8f3f
> >  Open MPI release date: Mar 13, 2018
> >   Open RTE: 2.1.3
> > Open RTE repo revision: v2.1.2-129-gcfd8f3f
> >  Open RTE release date: Mar 13, 2018
> >   OPAL: 2.1.3
> > OPAL repo revision: v2.1.2-129-gcfd8f3f
> >  OPAL release date: Mar 13, 2018
> > ...
> > 
> > Again I was able to find this symbol in the mca_io_ompio.so library.
> > I looked through the source again, and it seems pretty clear that 
the
> > function is indeed called, but the library isn't linked to list the
> > mca_io_ompio.so library as a dependency
> > 
> > Looking through the various shared libraries in the .../lib/openmpi
> > directory though, and it seems none of them have dependencies on 
each

Re: [OMPI devel] Shared object dependencies

2018-06-08 Thread Jeff Squyres (jsquyres) via devel
Before digging any deeper, did you perchance install multiple versions of Open 
MPI into the same prefix?

If so, remember that Open MPI installs lots of plugins.  The exact set of 
plugins changes every release.  So if you install version A.B.C in to 
/opt/openmpi, and then install version X.Y.Z in to /opt/openmpi, note that the 
installation of X.Y.Z did not *uninstall* A.B.C first.  Hence, you might still 
have some stale A.B.C components in the tree that Open MPI X.Y.Z may try to 
open.  Since the underlying libraries that these plugins use have now been 
upgraded to X.Y.Z, the stale A.B.C component may (and likely will) fail to open.

If that's not what is happening, let us know and we can dig deeper.


> On Jun 8, 2018, at 5:37 PM, Tyson Whitehead  wrote:
> 
> This email starts out talking about version 1.10.7 to give a complete
> picture.  I tested 2.1.3 as well, it also exhibits this issue,
> although to a lesser extent though, and am asking for help on that
> release.
> 
> I was compiling the OpenMPI 1.10.7 shipped with NixOS against a newer
> libibverbs with a large set of drivers and get some strange errors
> when when running opmi_info (I've replaced the common prefix
> /nix/store/9zm0pqsh67fw0xi5cpnybnd7hgzryffs-openmpi-1.10.7 with ...)
> 
> [mon241:04077] mca: base: component_find: unable to open
> .../lib/openmpi/mca_btl_openib: .../lib/openmpi/mca_btl_openib.so:
> undefined symbol: mca_mpool_grdma_evict (ignored)
> [mon241:04077] mca: base: component_find: unable to open
> .../lib/openmpi/mca_fcoll_individual:
> .../lib/openmpi/mca_fcoll_individual.so: undefined symbol:
> mca_io_ompio_file_write (ignored)
> [mon241:04077] mca: base: component_find: unable to open
> .../lib/openmpi/mca_fcoll_ylib: .../lib/openmpi/mca_fcoll_ylib.so:
> undefined symbol: ompi_io_ompio_scatter_data (ignored)
> [mon241:04077] mca: base: component_find: unable to open
> .../lib/openmpi/mca_fcoll_dynamic:
> .../lib/openmpi/mca_fcoll_dynamic.so: undefined symbol:
> ompi_io_ompio_allgatherv_array (ignored)
> [mon241:04077] mca: base: component_find: unable to open
> .../lib/openmpi/mca_fcoll_two_phase:
> .../lib/openmpi/mca_fcoll_two_phase.so: undefined symbol:
> ompi_io_ompio_set_aggregator_props (ignored)
> [mon241:04077] mca: base: component_find: unable to open
> .../lib/openmpi/mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so:
> undefined symbol: ompi_io_ompio_allgather_array (ignored)
> Package: Open MPI nixbld@ Distribution
>   Open MPI: 1.10.7
> Open MPI repo revision: v1.10.6-48-g5e373bf
>  Open MPI release date: May 16, 2017
>   Open RTE: 1.10.7
> Open RTE repo revision: v1.10.6-48-g5e373bf
>  Open RTE release date: May 16, 2017
>   OPAL: 1.10.7
> OPAL repo revision: v1.10.6-48-g5e373bf
>  OPAL release date: May 16, 2017
> ...
> 
> I dug into the first of these (figured out what library provided it,
> looked at the declared dependencies, poked around in the automake
> file) , and, as far as I could determine, it seems that
> mca_btl_openib.so simply isn't linked to list mca_mpool_grdma.so
> (which provides the symbol) as a dependency.
> 
> Seeing as 1.10.7 is no longer supported.  I figured I would try 2.1.3
> in case this has been fixed.  I compiled it up as well, and it seems
> all but the mca_fcoll_individual one have been resolved (I've replaced
> /nix/store/4kh0zbn8pmdqhvwagicswg70rwnpm570-openmpi-2.1.3 with ...)
> 
> [mon241:05544] mca_base_component_repository_open: unable to open
> mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so:
> undefined symbol: ompio_io_ompio_file_read (ignored)
> Package: Open MPI nixbld@ Distribution
>   Open MPI: 2.1.3
> Open MPI repo revision: v2.1.2-129-gcfd8f3f
>  Open MPI release date: Mar 13, 2018
>   Open RTE: 2.1.3
> Open RTE repo revision: v2.1.2-129-gcfd8f3f
>  Open RTE release date: Mar 13, 2018
>   OPAL: 2.1.3
> OPAL repo revision: v2.1.2-129-gcfd8f3f
>  OPAL release date: Mar 13, 2018
> ...
> 
> Again I was able to find this symbol in the mca_io_ompio.so library.
> I looked through the source again, and it seems pretty clear that the
> function is indeed called, but the library isn't linked to list the
> mca_io_ompio.so library as a dependency
> 
> Looking through the various shared libraries in the .../lib/openmpi
> directory though, and it seems none of them have dependencies on each
> other.  How is this suppose to work?  Is the component library just
> suppose to load everything so all symbols get resolved?  Is the above
> error I'm seeing an error then?
> 
> Any insight would be appreciated.
> 
> Thanks!  -Tyson
> 
> PS:  Please note that the openmpi code was compiled without any
> patches and without any special configure flags other than
> --prefix= (NixOS also adds --diasble-static and
> --disable-dependency-tracking by default, but I removed those, it
> didn't make a difference)..
> 

[OMPI devel] Shared object dependencies

2018-06-08 Thread Tyson Whitehead
This email starts out talking about version 1.10.7 to give a complete
picture.  I tested 2.1.3 as well, it also exhibits this issue,
although to a lesser extent though, and am asking for help on that
release.

I was compiling the OpenMPI 1.10.7 shipped with NixOS against a newer
libibverbs with a large set of drivers and get some strange errors
when when running opmi_info (I've replaced the common prefix
/nix/store/9zm0pqsh67fw0xi5cpnybnd7hgzryffs-openmpi-1.10.7 with ...)

[mon241:04077] mca: base: component_find: unable to open
.../lib/openmpi/mca_btl_openib: .../lib/openmpi/mca_btl_openib.so:
undefined symbol: mca_mpool_grdma_evict (ignored)
[mon241:04077] mca: base: component_find: unable to open
.../lib/openmpi/mca_fcoll_individual:
.../lib/openmpi/mca_fcoll_individual.so: undefined symbol:
mca_io_ompio_file_write (ignored)
[mon241:04077] mca: base: component_find: unable to open
.../lib/openmpi/mca_fcoll_ylib: .../lib/openmpi/mca_fcoll_ylib.so:
undefined symbol: ompi_io_ompio_scatter_data (ignored)
[mon241:04077] mca: base: component_find: unable to open
.../lib/openmpi/mca_fcoll_dynamic:
.../lib/openmpi/mca_fcoll_dynamic.so: undefined symbol:
ompi_io_ompio_allgatherv_array (ignored)
[mon241:04077] mca: base: component_find: unable to open
.../lib/openmpi/mca_fcoll_two_phase:
.../lib/openmpi/mca_fcoll_two_phase.so: undefined symbol:
ompi_io_ompio_set_aggregator_props (ignored)
[mon241:04077] mca: base: component_find: unable to open
.../lib/openmpi/mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so:
undefined symbol: ompi_io_ompio_allgather_array (ignored)
 Package: Open MPI nixbld@ Distribution
   Open MPI: 1.10.7
 Open MPI repo revision: v1.10.6-48-g5e373bf
  Open MPI release date: May 16, 2017
   Open RTE: 1.10.7
 Open RTE repo revision: v1.10.6-48-g5e373bf
  Open RTE release date: May 16, 2017
   OPAL: 1.10.7
 OPAL repo revision: v1.10.6-48-g5e373bf
  OPAL release date: May 16, 2017
...

I dug into the first of these (figured out what library provided it,
looked at the declared dependencies, poked around in the automake
file) , and, as far as I could determine, it seems that
mca_btl_openib.so simply isn't linked to list mca_mpool_grdma.so
(which provides the symbol) as a dependency.

Seeing as 1.10.7 is no longer supported.  I figured I would try 2.1.3
in case this has been fixed.  I compiled it up as well, and it seems
all but the mca_fcoll_individual one have been resolved (I've replaced
/nix/store/4kh0zbn8pmdqhvwagicswg70rwnpm570-openmpi-2.1.3 with ...)

[mon241:05544] mca_base_component_repository_open: unable to open
mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so:
undefined symbol: ompio_io_ompio_file_read (ignored)
 Package: Open MPI nixbld@ Distribution
   Open MPI: 2.1.3
 Open MPI repo revision: v2.1.2-129-gcfd8f3f
  Open MPI release date: Mar 13, 2018
   Open RTE: 2.1.3
 Open RTE repo revision: v2.1.2-129-gcfd8f3f
  Open RTE release date: Mar 13, 2018
   OPAL: 2.1.3
 OPAL repo revision: v2.1.2-129-gcfd8f3f
  OPAL release date: Mar 13, 2018
...

Again I was able to find this symbol in the mca_io_ompio.so library.
I looked through the source again, and it seems pretty clear that the
function is indeed called, but the library isn't linked to list the
mca_io_ompio.so library as a dependency

Looking through the various shared libraries in the .../lib/openmpi
directory though, and it seems none of them have dependencies on each
other.  How is this suppose to work?  Is the component library just
suppose to load everything so all symbols get resolved?  Is the above
error I'm seeing an error then?

Any insight would be appreciated.

Thanks!  -Tyson

PS:  Please note that the openmpi code was compiled without any
patches and without any special configure flags other than
--prefix= (NixOS also adds --diasble-static and
--disable-dependency-tracking by default, but I removed those, it
didn't make a difference)..
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel