Re: [easybuild] Re: UCX ibv_create_cq and UCP worker errors on nodes with EL8 OS and Omni-Path fabric

2021-12-03 Thread Bart Oldeman
Hi Ole,

we found that UCX isn't very useful not performant on OmniPath, so if your
compiled isn't used on both InfiniBand and OmniPath you can compile OpenMPI
using "eb --filter-deps=UCX ..."
Open MPI works well there either using libpsm2 directly (using the "cm" pml
and "psm2" mtl), or via libfabric (using the same "cm" pml and the "ofi"
mtl)

We use the same Open MPI binaries on multiple clusters but set this on
OmniPath:
OMPI_MCA_btl='^openib'
OMPI_MCA_osc='^ucx'
OMPI_MCA_pml='^ucx'
to disable UCX and openib at runtime. If you include UCX in EB's OpenMPI it
will not compile in "openib" so the first one of those three would not be
needed.

Regards,
Bart

On Fri, 3 Dec 2021 at 07:29, Ole Holm Nielsen 
wrote:

> Hi Åke,
>
> On 12/3/21 08:27, Åke Sandgren wrote:
> >> On 02-12-2021 14:18, Åke Sandgren wrote:
> >>> On 12/2/21 2:06 PM, Ole Holm Nielsen wrote:
>  These are updated observations of running OpenMPI codes with an
>  Omni-Path network fabric on AlmaLinux 8.5::
> 
>  Using the foss-2021b toolchain and OpenMPI/4.1.1-GCC-11.2.0 my trivial
>  MPI test code works correctly:
> 
>  $ ml OpenMPI
>  $ ml
> 
>  Currently Loaded Modules:
>  1) GCCcore/11.2.0 9)
> hwloc/2.5.0-GCCcore-11.2.0
>  2) zlib/1.2.11-GCCcore-11.2.010) OpenSSL/1.1
>  3) binutils/2.37-GCCcore-11.2.0  11)
>  libevent/2.1.12-GCCcore-11.2.0
>  4) GCC/11.2.012) UCX/1.11.2-GCCcore-11.2.0
>  5) numactl/2.0.14-GCCcore-11.2.0 13)
>  libfabric/1.13.2-GCCcore-11.2.0
>  6) XZ/5.2.5-GCCcore-11.2.0   14) PMIx/4.1.0-GCCcore-11.2.0
>  7) libxml2/2.9.10-GCCcore-11.2.0 15) OpenMPI/4.1.1-GCC-11.2.0
>  8) libpciaccess/0.16-GCCcore-11.2.0
> 
>  $ mpicc mpi_test.c
>  $ mpirun -n 2 a.out
> 
>  (null): There are 2 processes
> 
>  (null): Rank  1:  d008
> 
>  (null): Rank  0:  d008
> 
> 
>  I also tried the OpenMPI/4.1.0-GCC-10.2.0 module, but this still gives
>  the error messages:
> 
>  $ ml OpenMPI/4.1.0-GCC-10.2.0
>  $ ml
> 
>  Currently Loaded Modules:
>  1) GCCcore/10.2.0   3) binutils/2.35-GCCcore-10.2.0
> 5)
>  numactl/2.0.13-GCCcore-10.2.0   7) libxml2/2.9.10-GCCcore-10.2.0
> 9)
>  hwloc/2.2.0-GCCcore-10.2.0  11) UCX/1.9.0-GCCcore-10.2.0
> 13)
>  PMIx/3.1.5-GCCcore-10.2.0
>  2) zlib/1.2.11-GCCcore-10.2.0   4) GCC/10.2.0
> 6)
>  XZ/5.2.5-GCCcore-10.2.0 8) libpciaccess/0.16-GCCcore-10.2.0
> 10)
>  libevent/2.1.12-GCCcore-10.2.0  12) libfabric/1.11.0-GCCcore-10.2.0
> 14)
>  OpenMPI/4.1.0-GCC-10.2.0
> 
>  $ mpicc mpi_test.c
>  $ mpirun -n 2 a.out
>  [1638449983.577933] [d008:910356:0]   ib_iface.c:966  UCX  ERROR
>  ibv_create_cq(cqe=4096) failed: Operation not supported
>  [1638449983.577827] [d008:910355:0]   ib_iface.c:966  UCX  ERROR
>  ibv_create_cq(cqe=4096) failed: Operation not supported
>  [d008.nifl.fysik.dtu.dk:910355] pml_ucx.c:273  Error: Failed to
> create
>  UCP worker
>  [d008.nifl.fysik.dtu.dk:910356] pml_ucx.c:273  Error: Failed to
> create
>  UCP worker
> 
>  (null): There are 2 processes
> 
>  (null): Rank  0:  d008
> 
>  (null): Rank  1:  d008
> 
>  Conclusion: The foss-2021b toolchain with OpenMPI/4.1.1-GCC-11.2.0
> seems
>  to be required on systems with an Omni-Path network fabric on
> AlmaLinux
>  8.5.  Perhaps the newer UCX/1.11.2-GCCcore-11.2.0 is really what's
>  needed, compared to UCX/1.9.0-GCCcore-10.2.0 from foss-2020b.
> 
>  Does anyone have comments on this?
> >>>
> >>> UCX is the problem here in combination with libfabric I think. Write a
> >>> hook that upgrades the version of UCX to 1.11-something if it's <
> >>> 1.11-ish, or just that specific version if you have older-and-working
> >>> versions.
> >>
> >> You are right that the nodes with Omni-Path have different libfabric
> >> packages which come from the EL8.5 BaseOS as well as the latest
> >> Cornelis/Intel Omni-Path drivers:
> >>
> >> $ rpm -qa | grep libfabric
> >> libfabric-verbs-1.10.0-2.x86_64
> >> libfabric-1.12.1-1.el8.x86_64
> >> libfabric-devel-1.12.1-1.el8.x86_64
> >> libfabric-psm2-1.10.0-2.x86_64
> >>
> >> The 1.12 packages are from EL8.5, and 1.10 packages are from Cornelis.
> >>
> >> Regarding UCX, I was first using the trusted foss-2020b toolchain which
> >> includes UCX/1.9.0-GCCcore-10.2.0. I guess that we shouldn't mess with
> >> the toolchains?
> >>
> >> The foss-2021b toolchain includes the newer UCX 1.11, which seems to
> >> solve this particular problem.
> >>
> >> Can we make any best practices recommendations from these observations?
> >
> > I didn't check properly, but UCX does not depend on libfabric, OpenMPI
> > does, so I'd write a hook that replaces libfabric < 1.12 with at least
> > 1.12.1.
> > Sometimes you just have to

Re: [easybuild] Re: UCX ibv_create_cq and UCP worker errors on nodes with EL8 OS and Omni-Path fabric

2021-12-03 Thread Ole Holm Nielsen

Hi Åke,

On 12/3/21 08:27, Åke Sandgren wrote:

On 02-12-2021 14:18, Åke Sandgren wrote:

On 12/2/21 2:06 PM, Ole Holm Nielsen wrote:

These are updated observations of running OpenMPI codes with an
Omni-Path network fabric on AlmaLinux 8.5::

Using the foss-2021b toolchain and OpenMPI/4.1.1-GCC-11.2.0 my trivial
MPI test code works correctly:

$ ml OpenMPI
$ ml

Currently Loaded Modules:
    1) GCCcore/11.2.0 9) hwloc/2.5.0-GCCcore-11.2.0
    2) zlib/1.2.11-GCCcore-11.2.0    10) OpenSSL/1.1
    3) binutils/2.37-GCCcore-11.2.0  11)
libevent/2.1.12-GCCcore-11.2.0
    4) GCC/11.2.0    12) UCX/1.11.2-GCCcore-11.2.0
    5) numactl/2.0.14-GCCcore-11.2.0 13)
libfabric/1.13.2-GCCcore-11.2.0
    6) XZ/5.2.5-GCCcore-11.2.0   14) PMIx/4.1.0-GCCcore-11.2.0
    7) libxml2/2.9.10-GCCcore-11.2.0 15) OpenMPI/4.1.1-GCC-11.2.0
    8) libpciaccess/0.16-GCCcore-11.2.0

$ mpicc mpi_test.c
$ mpirun -n 2 a.out

(null): There are 2 processes

(null): Rank  1:  d008

(null): Rank  0:  d008


I also tried the OpenMPI/4.1.0-GCC-10.2.0 module, but this still gives
the error messages:

$ ml OpenMPI/4.1.0-GCC-10.2.0
$ ml

Currently Loaded Modules:
    1) GCCcore/10.2.0   3) binutils/2.35-GCCcore-10.2.0   5)
numactl/2.0.13-GCCcore-10.2.0   7) libxml2/2.9.10-GCCcore-10.2.0  9)
hwloc/2.2.0-GCCcore-10.2.0  11) UCX/1.9.0-GCCcore-10.2.0 13)
PMIx/3.1.5-GCCcore-10.2.0
    2) zlib/1.2.11-GCCcore-10.2.0   4) GCC/10.2.0 6)
XZ/5.2.5-GCCcore-10.2.0 8) libpciaccess/0.16-GCCcore-10.2.0  10)
libevent/2.1.12-GCCcore-10.2.0  12) libfabric/1.11.0-GCCcore-10.2.0  14)
OpenMPI/4.1.0-GCC-10.2.0

$ mpicc mpi_test.c
$ mpirun -n 2 a.out
[1638449983.577933] [d008:910356:0]   ib_iface.c:966  UCX  ERROR
ibv_create_cq(cqe=4096) failed: Operation not supported
[1638449983.577827] [d008:910355:0]   ib_iface.c:966  UCX  ERROR
ibv_create_cq(cqe=4096) failed: Operation not supported
[d008.nifl.fysik.dtu.dk:910355] pml_ucx.c:273  Error: Failed to create
UCP worker
[d008.nifl.fysik.dtu.dk:910356] pml_ucx.c:273  Error: Failed to create
UCP worker

(null): There are 2 processes

(null): Rank  0:  d008

(null): Rank  1:  d008

Conclusion: The foss-2021b toolchain with OpenMPI/4.1.1-GCC-11.2.0 seems
to be required on systems with an Omni-Path network fabric on AlmaLinux
8.5.  Perhaps the newer UCX/1.11.2-GCCcore-11.2.0 is really what's
needed, compared to UCX/1.9.0-GCCcore-10.2.0 from foss-2020b.

Does anyone have comments on this?


UCX is the problem here in combination with libfabric I think. Write a
hook that upgrades the version of UCX to 1.11-something if it's <
1.11-ish, or just that specific version if you have older-and-working
versions.


You are right that the nodes with Omni-Path have different libfabric
packages which come from the EL8.5 BaseOS as well as the latest
Cornelis/Intel Omni-Path drivers:

$ rpm -qa | grep libfabric
libfabric-verbs-1.10.0-2.x86_64
libfabric-1.12.1-1.el8.x86_64
libfabric-devel-1.12.1-1.el8.x86_64
libfabric-psm2-1.10.0-2.x86_64

The 1.12 packages are from EL8.5, and 1.10 packages are from Cornelis.

Regarding UCX, I was first using the trusted foss-2020b toolchain which
includes UCX/1.9.0-GCCcore-10.2.0. I guess that we shouldn't mess with
the toolchains?

The foss-2021b toolchain includes the newer UCX 1.11, which seems to
solve this particular problem.

Can we make any best practices recommendations from these observations?


I didn't check properly, but UCX does not depend on libfabric, OpenMPI
does, so I'd write a hook that replaces libfabric < 1.12 with at least
1.12.1.
Sometimes you just have to mess with the toolchains, and this looks like
one of those situations.

Or as a test build your own OpenMPI-4.1.0 or 4.0.5 (that 2020b uses)
with an updated libfabric and check if that fixes the problem. And if it
does, write a hook that replaces libfabric. See the framework/contrib
for examples, I did that for UCX so there is code there to show you how.


I don't feel qualified to mess around with modifying EB toolchains...

The foss-2021b toolchain including OpenMPI/4.1.1-GCC-11.2.0 seems to solve 
the present problem.  Do you think there are any disadvantages with asking 
users to go for foss-2021b?  Of course we may need several modules to be 
upgraded from foss-2020b to foss-2021b.


Another possibility may be the coming driver upgrade from Cornelis 
Networks to support the Omni-Path fabric on EL 8.4 and EL 8.5.  I'm 
definitely going to check this when it becomes available.


Thanks,
Ole


[easybuild] MACS3 and PythonBundle

2021-12-03 Thread Arnau
Hi all,

I was installing MACS3 using  PythonBundle.My understanding is that a
bundle is just a list of packages on some order so they all know about each
other.

MACS3 needs 3 packages to be installed: Cython, numpy, cykhash, in this
order, so I created something like:

[...]

exts_list = [
('Cython','0.29.24',{
}),
('numpy', '1.21.3', {
'sources': ['%(name)s-%(version)s.zip'],
'patches': [
'numpy-1.18.2-mkl.patch',
'numpy-1.20.3_disable-broken-override-test.patch',
'numpy-1.20.3_disable_fortran_callback_test.patch',
],
}),
('cykhash', '2.0.0', {
}),
('MACS3', '3.0.0a6', {
'modulename' : 'MACS3'
}),
]

[...]

The first 3 installed successfully but MACS3 failed to install because it
was not finding cykhash:

ERROR: Could not find a version that satisfies the requirement
cykhash>=1.0.2 (from macs3) (from versions: none)
ERROR: No matching distribution found for cykhash>=1.0.2


it was, mainly, because of the ignore-installed option from the pip
installation:

 pip install
--prefix=/home/bria/easybuildinstall/software/MACS3/3.0.0a6-foss-2021a
--ignore-installed  --no-index  --no-build-isolation  ."

So I had to pass the "pip_ignore_installed = False" option and then MACS3
was able to find cykhash...


I find this a bit weird as, as said a the beginning, my understanding of
the bundle is: "a list of python packages installed in a way that each
knows about the others (and that's why order matters)."

Am I wrong with my understanding on how the pythonbundle works?

(I have other questions, but I guess that all depeden on this one :-) )

TIA,
Arnau


[easybuild] Question on Python package installation

2021-12-03 Thread Arnau
Hi all,

I've a generic question on how to install Python packages.It's related to
this: Docs needed on when to use the Python EasyConfig, Bundle, or
PythonPackage to install Python packages (similar for R?) · Issue #398 ·
easybuilders/easybuild (github.com)

What do you usually use for python package installation? what is your rule
(or the easyconfig rule) if any?

TIA,
Arnau