Re: [OMPI devel] ARM failure on PR to master

2018-06-10 Thread r...@open-mpi.org
Now moved to https://github.com/open-mpi/ompi/pull/5258 
 - same error


> On Jun 8, 2018, at 9:04 PM, r...@open-mpi.org wrote:
> 
> Can someone who knows/cares about ARM perhaps take a look at PR 
> https://github.com/open-mpi/ompi/pull/5247 
> ? I’m hitting an error in the ARM 
> CI tests that I can’t understand:
> 
> --> Running example: hello_c
> --
> Failed to create a completion queue (CQ):
> 
> Hostname: juno001
> Requested CQE: 16384
> Error:Cannot allocate memory
> 
> Check the CQE attribute.
> --
> --
> Open MPI has detected that there are UD-capable Verbs devices on your
> system, but none of them were able to be setup properly.  This may
> indicate a problem on this system.
> 
> You job will continue, but Open MPI will ignore the "ud" oob component
> in this run.
> 
> Hostname: juno001
> --
> vsetenv PMIX_SERVER_TMPDIR failed
> vsetenv PMIX_SERVER_TMPDIR failed
> vsetenv PMIX_SERVER_TMPDIR failed
> vsetenv PMIX_SERVER_TMPDIR failed
> 
> I get the UD error - that has been around for years since nobody seems to 
> care about or maintain the ud/oob component. What I don’t understand is why 
> setting an envar would fail solely in the ARM environment.
> 
> Could someone maybe at least provide a hint as to what is going on?
> 
> Thanks
> Ralph
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Shared object dependencies

2018-06-10 Thread Gilles Gouaillardet
Edgar,

I checked the various release branches, and I think this issue was
fixed by 
https://github.com/open-mpi/ompi/commit/ccf76b779130e065de326f71fe6bac868c565300

This was back-ported into the v3.0.x branch, and that was before the
v3.1.x branch was created.

This has *not* been backported into the v2.x series, and as far as I
am concerned, that would fix the abstraction violation I mentioned
earlier.

I noted the fcoll framework is open is mca_io_base_file_select(), so
an other (a bit convoluted imho, but that could require less changes)
way could be to open the framework in the io/ompio component.


Cheers,

Gilles
On Sat, Jun 9, 2018 at 7:59 AM Gabriel, Edgar  wrote:
>
> I wanted to add one item before I forget (although I agree with what Jeff 
> said): The error messages shown reminds me of the problem that we had with 
> ompio  in 1.8/1.10 series when the RTLD_GLOBAL  option was not correctly set. 
> However, that was fixed in the 2.0 series and going forward, so if that shows 
> up with later releases, it might an indication of something else.
>
> Edgar
>
> > -Original Message-
> > From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Jeff
> > Squyres (jsquyres) via devel
> > Sent: Friday, June 8, 2018 4:54 PM
> > To: Open MPI Developers List 
> > Cc: Jeff Squyres (jsquyres) 
> > Subject: Re: [OMPI devel] Shared object dependencies
> >
> > Before digging any deeper, did you perchance install multiple versions of 
> > Open
> > MPI into the same prefix?
> >
> > If so, remember that Open MPI installs lots of plugins.  The exact set of 
> > plugins
> > changes every release.  So if you install version A.B.C in to /opt/openmpi, 
> > and
> > then install version X.Y.Z in to /opt/openmpi, note that the installation 
> > of X.Y.Z
> > did not *uninstall* A.B.C first.  Hence, you might still have some stale 
> > A.B.C
> > components in the tree that Open MPI X.Y.Z may try to open.  Since the
> > underlying libraries that these plugins use have now been upgraded to X.Y.Z,
> > the stale A.B.C component may (and likely will) fail to open.
> >
> > If that's not what is happening, let us know and we can dig deeper.
> >
> >
> > > On Jun 8, 2018, at 5:37 PM, Tyson Whitehead 
> > wrote:
> > >
> > > This email starts out talking about version 1.10.7 to give a complete
> > > picture.  I tested 2.1.3 as well, it also exhibits this issue,
> > > although to a lesser extent though, and am asking for help on that
> > > release.
> > >
> > > I was compiling the OpenMPI 1.10.7 shipped with NixOS against a newer
> > > libibverbs with a large set of drivers and get some strange errors
> > > when when running opmi_info (I've replaced the common prefix
> > > /nix/store/9zm0pqsh67fw0xi5cpnybnd7hgzryffs-openmpi-1.10.7 with ...)
> > >
> > > [mon241:04077] mca: base: component_find: unable to open
> > > .../lib/openmpi/mca_btl_openib: .../lib/openmpi/mca_btl_openib.so:
> > > undefined symbol: mca_mpool_grdma_evict (ignored) [mon241:04077] mca:
> > > base: component_find: unable to open
> > > .../lib/openmpi/mca_fcoll_individual:
> > > .../lib/openmpi/mca_fcoll_individual.so: undefined symbol:
> > > mca_io_ompio_file_write (ignored)
> > > [mon241:04077] mca: base: component_find: unable to open
> > > .../lib/openmpi/mca_fcoll_ylib: .../lib/openmpi/mca_fcoll_ylib.so:
> > > undefined symbol: ompi_io_ompio_scatter_data (ignored) [mon241:04077]
> > > mca: base: component_find: unable to open
> > > .../lib/openmpi/mca_fcoll_dynamic:
> > > .../lib/openmpi/mca_fcoll_dynamic.so: undefined symbol:
> > > ompi_io_ompio_allgatherv_array (ignored) [mon241:04077] mca: base:
> > > component_find: unable to open
> > > .../lib/openmpi/mca_fcoll_two_phase:
> > > .../lib/openmpi/mca_fcoll_two_phase.so: undefined symbol:
> > > ompi_io_ompio_set_aggregator_props (ignored) [mon241:04077] mca: base:
> > > component_find: unable to open
> > > .../lib/openmpi/mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so:
> > > undefined symbol: ompi_io_ompio_allgather_array (ignored)
> > > Package: Open MPI nixbld@ Distribution
> > >   Open MPI: 1.10.7
> > > Open MPI repo revision: v1.10.6-48-g5e373bf  Open MPI release date:
> > > May 16, 2017
> > >   Open RTE: 1.10.7
> > > Open RTE repo revision: v1.10.6-48-g5e373bf  Open RTE release date:
> > > May 16, 2017
> > >   OPAL: 1.10.7
> > > OPAL repo revision: v1.10.6-48-g5e373bf
> > >  OPAL release date: May 16, 2017
> > > ...
> > >
> > > I dug into the first of these (figured out what library provided it,
> > > looked at the declared dependencies, poked around in the automake
> > > file) , and, as far as I could determine, it seems that
> > > mca_btl_openib.so simply isn't linked to list mca_mpool_grdma.so
> > > (which provides the symbol) as a dependency.
> > >
> > > Seeing as 1.10.7 is no longer supported.  I figured I would try 2.1.3
> > > in case this has been fixed.  I compiled it up as well, and it seems
> > > all but the mca_fc