Re: [easybuild] OpenMPI-4.1.4-GCC-12.2.0.eb Sanity check failed on AMD "Genoa" node

2023-09-28 Thread Ole Holm Nielsen

Dear Kenneth,

On 9/28/23 10:07, Kenneth Hoste wrote:
Unfortunately, building the foss-2022b toolchain exits during the 
testing phase of OpenMPI-4.1.4-GCC-12.2.0.eb as shown below.  Does 
anyone have ideas about what might be wrong?

...
By default OpenMPI is being configured with "--with-verbs", you should see 
that popping up in the log file (or use "eb --trace" to get some more info 
during the installation).


Thanks, I sort of suspected that IB was somehow being assumed tacitly by 
EB :-)


If you don't have Infiniband, you should add --without-verbs via 
configopts in your OpenMPI easyconfig file (which should prevent the 
OpenMPI easyblock from adding --with-verbs), or using a hook (see for 
example 
https://docs.easybuild.io/hooks/#replace-with-verbs-with-without-verbs-in-openmpi-configure-options, although that exact example won't work, you should just hard inject --without-verbs in self.cfg['configopts'] instead in the pre_configure_hook).


We eventually will use our AMD Genoa EB modules on some nodes to be 
installed next month which will include Mellanox/Nvidia Infiniband.


Question: Would it help if I take an old (like 10 years old) Mellanox IB 
PCIe adapter lying around and mount it in my server?  Or maybe a 
relatively new Omni-Path adapter?


Would that make the OpenMPI EB module happy, and would the module work 
with our future nodes?


Thanks,
Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,


Re: [easybuild] OpenMPI-4.1.4-GCC-12.2.0.eb Sanity check failed on AMD "Genoa" node

2023-09-28 Thread Kenneth Hoste

Dear Ole,

On 26/09/2023 08:24, Ole Holm Nielsen wrote:
I'm starting EasyBuild up on our new AMD "Genoa" platform with 1 AMD 
EPYC 9124 16-Core Processor with 2 threads/core, 384 GB RAM, Ethernet 
network only, and AlmaLinux 8.8 OS.


Unfortunately, building the foss-2022b toolchain exits during the 
testing phase of OpenMPI-4.1.4-GCC-12.2.0.eb as shown below.  Does 
anyone have ideas about what might be wrong?


$ eb foss-2022b.eb -r
(lines deleted)
== processing EasyBuild easyconfig 
/home/modules/software/EasyBuild/4.8.1/easybuild/easyconfigs/o/OpenMPI/OpenMPI-4.1.4-GCC-12.2.0.eb

== building and installing OpenMPI/4.1.4-GCC-12.2.0...
== fetching files...
== ... (took 1 secs)
== creating build dir, resetting environment...
== unpacking...
== ... (took 1 secs)
== patching...
== preparing...
== configuring...
== ... (took 2 mins 22 secs)
== building...
== ... (took 4 mins 24 secs)
== testing...
== ... (took 36 secs)
== installing...
== ... (took 1 min 15 secs)
== taking care of extensions...
== restore after iterating...
== postprocessing...
== sanity checking...
== ... (took 5 secs)
== FAILED: Installation ended unsuccessfully (build directory: 
/dev/shm/OpenMPI/4.1.4/GCC-12.2.0): build failed (first 300 chars): 
Sanity check failed: sanity check command 
OMPI_MCA_rmaps_base_oversubscribe=1 mpirun -n 8 
/dev/shm/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_hello_c exited with code 1 
(output: 
--
A requested component was not found, or was unable to be (took 8 mins 48 
secs)
== Results of the build can be found in the log file(s) 
/tmp/eb-watuyqhw/easybuild-OpenMPI-4.1.4-20230926.080727.GEZtD.log
ERROR: Build of 
/home/modules/software/EasyBuild/4.8.1/easybuild/easyconfigs/o/OpenMPI/OpenMPI-4.1.4-GCC-12.2.0.eb failed (err: 'build failed (first 300 chars): Sanity check failed: sanity check command OMPI_MCA_rmaps_base_oversubscribe=1 mpirun -n 8 /dev/shm/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_hello_c exited with code 1 (output: --\nA requested component was not found, or was unable to be')



The log file shows messages about missing components:

(lines deleted)
--
[e000.nifl.fysik.dtu.dk:1849636] PML cm cannot be selected
[e000.nifl.fysik.dtu.dk:1849635] PML cm cannot be selected
[e000.nifl.fysik.dtu.dk:1849626] 2 more processes have sent help message 
help-mca-base.txt / find-available:not-valid
[e000.nifl.fysik.dtu.dk:1849626] Set MCA parameter 
"orte_base_help_aggregate" to 0 to see all help / error messages
[e000.nifl.fysik.dtu.dk:1849626] 1 more process has sent help message 
help-mca-base.txt / find-available:none found

)
sanity check command mpirun -n 1 
/dev/shm/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_hello_usempi exited with code 
1 (output: 
--

A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:  e000.nifl.fysik.dtu.dk
Framework: mtl
Component: psm2
--
--
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

   Host:  e000
   Framework: pml
--
[e000.nifl.fysik.dtu.dk:1849661] PML cm cannot be selected
) (at easybuild/framework/easyblock.py:3655 in _sanity_check_step)
== 2023-09-26 08:16:16,111 build_log.py:267 INFO ... (took 5 secs)
== 2023-09-26 08:16:16,111 filetools.py:2012 INFO Removing lock 
/home/modules/software/.locks/_home_modules_software_OpenMPI_4.1.4-GCC-12.2.0.lock...
== 2023-09-26 08:16:16,112 filetools.py:383 INFO Path 
/home/modules/software/.locks/_home_modules_software_OpenMPI_4.1.4-GCC-12.2.0.lock successfully removed.
== 2023-09-26 08:16:16,112 filetools.py:2016 INFO Lock removed: 
/home/modules/software/.locks/_home_modules_software_OpenMPI_4.1.4-GCC-12.2.0.lock
== 2023-09-26 08:16:16,112 easyblock.py:4277 WARNING build failed (first 
300 chars): Sanity check failed: sanity check command 
OMPI_MCA_rmaps_base_oversubscribe=1 mpirun -n 8 
/dev/shm/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_hello_c exited with code 1 
(output: 
--

A requested component was not found, or was unable to be
== 

[easybuild] OpenMPI-4.1.4-GCC-12.2.0.eb Sanity check failed on AMD "Genoa" node

2023-09-26 Thread Ole Holm Nielsen
I'm starting EasyBuild up on our new AMD "Genoa" platform with 1 AMD EPYC 
9124 16-Core Processor with 2 threads/core, 384 GB RAM, Ethernet network 
only, and AlmaLinux 8.8 OS.


Unfortunately, building the foss-2022b toolchain exits during the testing 
phase of OpenMPI-4.1.4-GCC-12.2.0.eb as shown below.  Does anyone have 
ideas about what might be wrong?


$ eb foss-2022b.eb -r
(lines deleted)
== processing EasyBuild easyconfig 
/home/modules/software/EasyBuild/4.8.1/easybuild/easyconfigs/o/OpenMPI/OpenMPI-4.1.4-GCC-12.2.0.eb

== building and installing OpenMPI/4.1.4-GCC-12.2.0...
== fetching files...
== ... (took 1 secs)
== creating build dir, resetting environment...
== unpacking...
== ... (took 1 secs)
== patching...
== preparing...
== configuring...
== ... (took 2 mins 22 secs)
== building...
== ... (took 4 mins 24 secs)
== testing...
== ... (took 36 secs)
== installing...
== ... (took 1 min 15 secs)
== taking care of extensions...
== restore after iterating...
== postprocessing...
== sanity checking...
== ... (took 5 secs)
== FAILED: Installation ended unsuccessfully (build directory: 
/dev/shm/OpenMPI/4.1.4/GCC-12.2.0): build failed (first 300 chars): Sanity 
check failed: sanity check command OMPI_MCA_rmaps_base_oversubscribe=1 
mpirun -n 8 /dev/shm/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_hello_c exited with 
code 1 (output: 
--

A requested component was not found, or was unable to be (took 8 mins 48 secs)
== Results of the build can be found in the log file(s) 
/tmp/eb-watuyqhw/easybuild-OpenMPI-4.1.4-20230926.080727.GEZtD.log
ERROR: Build of 
/home/modules/software/EasyBuild/4.8.1/easybuild/easyconfigs/o/OpenMPI/OpenMPI-4.1.4-GCC-12.2.0.eb 
failed (err: 'build failed (first 300 chars): Sanity check failed: sanity 
check command OMPI_MCA_rmaps_base_oversubscribe=1 mpirun -n 8 
/dev/shm/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_hello_c exited with code 1 
(output: 
--\nA 
requested component was not found, or was unable to be')



The log file shows messages about missing components:

(lines deleted)
--
[e000.nifl.fysik.dtu.dk:1849636] PML cm cannot be selected
[e000.nifl.fysik.dtu.dk:1849635] PML cm cannot be selected
[e000.nifl.fysik.dtu.dk:1849626] 2 more processes have sent help message 
help-mca-base.txt / find-available:not-valid
[e000.nifl.fysik.dtu.dk:1849626] Set MCA parameter 
"orte_base_help_aggregate" to 0 to see all help / error messages
[e000.nifl.fysik.dtu.dk:1849626] 1 more process has sent help message 
help-mca-base.txt / find-available:none found

)
sanity check command mpirun -n 1 
/dev/shm/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_hello_usempi exited with code 1 
(output: 
--

A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:  e000.nifl.fysik.dtu.dk
Framework: mtl
Component: psm2
--
--
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:  e000
  Framework: pml
--
[e000.nifl.fysik.dtu.dk:1849661] PML cm cannot be selected
) (at easybuild/framework/easyblock.py:3655 in _sanity_check_step)
== 2023-09-26 08:16:16,111 build_log.py:267 INFO ... (took 5 secs)
== 2023-09-26 08:16:16,111 filetools.py:2012 INFO Removing lock 
/home/modules/software/.locks/_home_modules_software_OpenMPI_4.1.4-GCC-12.2.0.lock...
== 2023-09-26 08:16:16,112 filetools.py:383 INFO Path 
/home/modules/software/.locks/_home_modules_software_OpenMPI_4.1.4-GCC-12.2.0.lock 
successfully removed.
== 2023-09-26 08:16:16,112 filetools.py:2016 INFO Lock removed: 
/home/modules/software/.locks/_home_modules_software_OpenMPI_4.1.4-GCC-12.2.0.lock
== 2023-09-26 08:16:16,112 easyblock.py:4277 WARNING build failed (first 
300 chars): Sanity check failed: sanity check command 
OMPI_MCA_rmaps_base_oversubscribe=1 mpirun -n 8 
/dev/shm/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_hello_c exited with code 1 
(output: 
--

A requested component was not found, or was unable to be
== 2023-09-26 08:16:16,112 easyblock.py:328 INFO Closing log