I am trying to build the new GPAW module from
https://github.com/easybuilders/easybuild-easyconfigs/pull/9834:
$ eb --from-pr=9834 GPAW-20.1.0-foss-2019b-Python-3.7.4.eb -r
This works without problems on our own Linux (CentOS 7.7) cluster with
Slurm 19.05.
However, on a remote cluster running CentOS 7.6 with Slurm 18.08.1 I am
having difficulties. I have to run EB as a Slurm interactive task on a
compute node which I access using Slurm's "srun" command. Then the above
build always fails during the testing stage with these errors:
== sanity checking...
== FAILED: Installation ended unsuccessfully (build directory:
/dev/shm/GPAW/20.1.0/foss-2019b-Python-3.7.4): build failed (first 300
chars): Sanity check failed: command
"/groups/others/ohni/skylake/software/Python/3.7.4-GCCcore-8.3.0/bin/python
-c "import gpaw"" failed; output:
OPAL ERROR: Not initialized in file pmix2x_client.c at line 112
and the log file says:
== 2020-02-14 11:27:00,210 run.py:219 INFO running cmd:
/groups/others/ohni/skylake/software/Python/3.7.4-GCCcore-8.3.0/bin/python
-c "import gpaw"
== 2020-02-14 11:27:00,487 extension.py:212 WARNING Sanity check for
'GPAW' extension failed: command
"/groups/others/ohni/skylake/software/Python/3.7.4-GCCcore-8.3.0/bin/python
-c "import gpaw"" failed; output:
[node252.cluster:100651] OPAL ERROR: Not initialized in file
pmix2x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
This seems strange because the slurm-libpmi RPM is in fact installed on
the system:
$ rpm -qa | grep slurm
slurm-libpmi-18.08.1-1.el7.x86_64
slurm-example-configs-18.08.1-1.el7.x86_64
slurm-18.08.1-1.el7.x86_64
slurm-pam_slurm-18.08.1-1.el7.x86_64
slurm-slurmd-18.08.1-1.el7.x86_64
$ rpm -ql slurm-libpmi-18.08.1-1.el7.x86_64
/usr/lib64/libpmi.so
/usr/lib64/libpmi.so.0
/usr/lib64/libpmi.so.0.0.0
/usr/lib64/libpmi2.so
/usr/lib64/libpmi2.so.0
/usr/lib64/libpmi2.so.0.0.0
Also, the loaded OpenMPI version appears to be sane enough:
$ which mpirun
~/skylake/software/OpenMPI/3.1.4-GCC-8.3.0/bin/mpirun
$ mpirun --version
mpirun (Open MPI) 3.1.4
$ ompi_info | grep ras
MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v3.1.4)
MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component
v3.1.4)
Googling for the OPAL ERROR I found a posting saying that missing the
CentOS 7 hwloc-devel RPM was the source of the problems:
https://users.open-mpi.narkive.com/C9HOavWo/ompi-users-fwd-openmpi-3-1-0-on-aarch64
Has anyone else seen this error? Could a missing hwloc-devel OS
prerequisite be causing problems? I have tried to load explicitly the EB
module hwloc/1.11.12-GCCcore-8.3.0, but that did not help.
Thanks,
Ole
--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark