Hi Åke,
On 12/3/21 08:27, Åke Sandgren wrote:
On 02-12-2021 14:18, Åke Sandgren wrote:
On 12/2/21 2:06 PM, Ole Holm Nielsen wrote:
These are updated observations of running OpenMPI codes with an
Omni-Path network fabric on AlmaLinux 8.5::
Using the foss-2021b toolchain and OpenMPI/4.1.1-GCC-11.2.0 my trivial
MPI test code works correctly:
$ ml OpenMPI
$ ml
Currently Loaded Modules:
1) GCCcore/11.2.0 9) hwloc/2.5.0-GCCcore-11.2.0
2) zlib/1.2.11-GCCcore-11.2.0 10) OpenSSL/1.1
3) binutils/2.37-GCCcore-11.2.0 11)
libevent/2.1.12-GCCcore-11.2.0
4) GCC/11.2.0 12) UCX/1.11.2-GCCcore-11.2.0
5) numactl/2.0.14-GCCcore-11.2.0 13)
libfabric/1.13.2-GCCcore-11.2.0
6) XZ/5.2.5-GCCcore-11.2.0 14) PMIx/4.1.0-GCCcore-11.2.0
7) libxml2/2.9.10-GCCcore-11.2.0 15) OpenMPI/4.1.1-GCC-11.2.0
8) libpciaccess/0.16-GCCcore-11.2.0
$ mpicc mpi_test.c
$ mpirun -n 2 a.out
(null): There are 2 processes
(null): Rank 1: d008
(null): Rank 0: d008
I also tried the OpenMPI/4.1.0-GCC-10.2.0 module, but this still gives
the error messages:
$ ml OpenMPI/4.1.0-GCC-10.2.0
$ ml
Currently Loaded Modules:
1) GCCcore/10.2.0 3) binutils/2.35-GCCcore-10.2.0 5)
numactl/2.0.13-GCCcore-10.2.0 7) libxml2/2.9.10-GCCcore-10.2.0 9)
hwloc/2.2.0-GCCcore-10.2.0 11) UCX/1.9.0-GCCcore-10.2.0 13)
PMIx/3.1.5-GCCcore-10.2.0
2) zlib/1.2.11-GCCcore-10.2.0 4) GCC/10.2.0 6)
XZ/5.2.5-GCCcore-10.2.0 8) libpciaccess/0.16-GCCcore-10.2.0 10)
libevent/2.1.12-GCCcore-10.2.0 12) libfabric/1.11.0-GCCcore-10.2.0 14)
OpenMPI/4.1.0-GCC-10.2.0
$ mpicc mpi_test.c
$ mpirun -n 2 a.out
[1638449983.577933] [d008:910356:0] ib_iface.c:966 UCX ERROR
ibv_create_cq(cqe=4096) failed: Operation not supported
[1638449983.577827] [d008:910355:0] ib_iface.c:966 UCX ERROR
ibv_create_cq(cqe=4096) failed: Operation not supported
[d008.nifl.fysik.dtu.dk:910355] pml_ucx.c:273 Error: Failed to create
UCP worker
[d008.nifl.fysik.dtu.dk:910356] pml_ucx.c:273 Error: Failed to create
UCP worker
(null): There are 2 processes
(null): Rank 0: d008
(null): Rank 1: d008
Conclusion: The foss-2021b toolchain with OpenMPI/4.1.1-GCC-11.2.0 seems
to be required on systems with an Omni-Path network fabric on AlmaLinux
8.5. Perhaps the newer UCX/1.11.2-GCCcore-11.2.0 is really what's
needed, compared to UCX/1.9.0-GCCcore-10.2.0 from foss-2020b.
Does anyone have comments on this?
UCX is the problem here in combination with libfabric I think. Write a
hook that upgrades the version of UCX to 1.11-something if it's <
1.11-ish, or just that specific version if you have older-and-working
versions.
You are right that the nodes with Omni-Path have different libfabric
packages which come from the EL8.5 BaseOS as well as the latest
Cornelis/Intel Omni-Path drivers:
$ rpm -qa | grep libfabric
libfabric-verbs-1.10.0-2.x86_64
libfabric-1.12.1-1.el8.x86_64
libfabric-devel-1.12.1-1.el8.x86_64
libfabric-psm2-1.10.0-2.x86_64
The 1.12 packages are from EL8.5, and 1.10 packages are from Cornelis.
Regarding UCX, I was first using the trusted foss-2020b toolchain which
includes UCX/1.9.0-GCCcore-10.2.0. I guess that we shouldn't mess with
the toolchains?
The foss-2021b toolchain includes the newer UCX 1.11, which seems to
solve this particular problem.
Can we make any best practices recommendations from these observations?
I didn't check properly, but UCX does not depend on libfabric, OpenMPI
does, so I'd write a hook that replaces libfabric < 1.12 with at least
1.12.1.
Sometimes you just have to mess with the toolchains, and this looks like
one of those situations.
Or as a test build your own OpenMPI-4.1.0 or 4.0.5 (that 2020b uses)
with an updated libfabric and check if that fixes the problem. And if it
does, write a hook that replaces libfabric. See the framework/contrib
for examples, I did that for UCX so there is code there to show you how.
I don't feel qualified to mess around with modifying EB toolchains...
The foss-2021b toolchain including OpenMPI/4.1.1-GCC-11.2.0 seems to solve
the present problem. Do you think there are any disadvantages with asking
users to go for foss-2021b? Of course we may need several modules to be
upgraded from foss-2020b to foss-2021b.
Another possibility may be the coming driver upgrade from Cornelis
Networks to support the Omni-Path fabric on EL 8.4 and EL 8.5. I'm
definitely going to check this when it becomes available.
Thanks,
Ole