Re: [easybuild] OpenMPI-4.1.4-GCC-12.2.0.eb Sanity check failed on AMD "Genoa" node
Dear Kenneth, On 9/28/23 10:07, Kenneth Hoste wrote: Unfortunately, building the foss-2022b toolchain exits during the testing phase of OpenMPI-4.1.4-GCC-12.2.0.eb as shown below. Does anyone have ideas about what might be wrong? ... By default OpenMPI is being configured with "--with-verbs", you should see that popping up in the log file (or use "eb --trace" to get some more info during the installation). Thanks, I sort of suspected that IB was somehow being assumed tacitly by EB :-) If you don't have Infiniband, you should add --without-verbs via configopts in your OpenMPI easyconfig file (which should prevent the OpenMPI easyblock from adding --with-verbs), or using a hook (see for example https://docs.easybuild.io/hooks/#replace-with-verbs-with-without-verbs-in-openmpi-configure-options, although that exact example won't work, you should just hard inject --without-verbs in self.cfg['configopts'] instead in the pre_configure_hook). We eventually will use our AMD Genoa EB modules on some nodes to be installed next month which will include Mellanox/Nvidia Infiniband. Question: Would it help if I take an old (like 10 years old) Mellanox IB PCIe adapter lying around and mount it in my server? Or maybe a relatively new Omni-Path adapter? Would that make the OpenMPI EB module happy, and would the module work with our future nodes? Thanks, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark,
Re: [easybuild] OpenMPI-4.1.4-GCC-12.2.0.eb Sanity check failed on AMD "Genoa" node
Dear Ole, On 26/09/2023 08:24, Ole Holm Nielsen wrote: I'm starting EasyBuild up on our new AMD "Genoa" platform with 1 AMD EPYC 9124 16-Core Processor with 2 threads/core, 384 GB RAM, Ethernet network only, and AlmaLinux 8.8 OS. Unfortunately, building the foss-2022b toolchain exits during the testing phase of OpenMPI-4.1.4-GCC-12.2.0.eb as shown below. Does anyone have ideas about what might be wrong? $ eb foss-2022b.eb -r (lines deleted) == processing EasyBuild easyconfig /home/modules/software/EasyBuild/4.8.1/easybuild/easyconfigs/o/OpenMPI/OpenMPI-4.1.4-GCC-12.2.0.eb == building and installing OpenMPI/4.1.4-GCC-12.2.0... == fetching files... == ... (took 1 secs) == creating build dir, resetting environment... == unpacking... == ... (took 1 secs) == patching... == preparing... == configuring... == ... (took 2 mins 22 secs) == building... == ... (took 4 mins 24 secs) == testing... == ... (took 36 secs) == installing... == ... (took 1 min 15 secs) == taking care of extensions... == restore after iterating... == postprocessing... == sanity checking... == ... (took 5 secs) == FAILED: Installation ended unsuccessfully (build directory: /dev/shm/OpenMPI/4.1.4/GCC-12.2.0): build failed (first 300 chars): Sanity check failed: sanity check command OMPI_MCA_rmaps_base_oversubscribe=1 mpirun -n 8 /dev/shm/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_hello_c exited with code 1 (output: -- A requested component was not found, or was unable to be (took 8 mins 48 secs) == Results of the build can be found in the log file(s) /tmp/eb-watuyqhw/easybuild-OpenMPI-4.1.4-20230926.080727.GEZtD.log ERROR: Build of /home/modules/software/EasyBuild/4.8.1/easybuild/easyconfigs/o/OpenMPI/OpenMPI-4.1.4-GCC-12.2.0.eb failed (err: 'build failed (first 300 chars): Sanity check failed: sanity check command OMPI_MCA_rmaps_base_oversubscribe=1 mpirun -n 8 /dev/shm/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_hello_c exited with code 1 (output: --\nA requested component was not found, or was unable to be') The log file shows messages about missing components: (lines deleted) -- [e000.nifl.fysik.dtu.dk:1849636] PML cm cannot be selected [e000.nifl.fysik.dtu.dk:1849635] PML cm cannot be selected [e000.nifl.fysik.dtu.dk:1849626] 2 more processes have sent help message help-mca-base.txt / find-available:not-valid [e000.nifl.fysik.dtu.dk:1849626] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [e000.nifl.fysik.dtu.dk:1849626] 1 more process has sent help message help-mca-base.txt / find-available:none found ) sanity check command mpirun -n 1 /dev/shm/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_hello_usempi exited with code 1 (output: -- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: e000.nifl.fysik.dtu.dk Framework: mtl Component: psm2 -- -- No components were able to be opened in the pml framework. This typically means that either no components of this type were installed, or none of the installed components can be loaded. Sometimes this means that shared libraries required by these components are unable to be found/loaded. Host: e000 Framework: pml -- [e000.nifl.fysik.dtu.dk:1849661] PML cm cannot be selected ) (at easybuild/framework/easyblock.py:3655 in _sanity_check_step) == 2023-09-26 08:16:16,111 build_log.py:267 INFO ... (took 5 secs) == 2023-09-26 08:16:16,111 filetools.py:2012 INFO Removing lock /home/modules/software/.locks/_home_modules_software_OpenMPI_4.1.4-GCC-12.2.0.lock... == 2023-09-26 08:16:16,112 filetools.py:383 INFO Path /home/modules/software/.locks/_home_modules_software_OpenMPI_4.1.4-GCC-12.2.0.lock successfully removed. == 2023-09-26 08:16:16,112 filetools.py:2016 INFO Lock removed: /home/modules/software/.locks/_home_modules_software_OpenMPI_4.1.4-GCC-12.2.0.lock == 2023-09-26 08:16:16,112 easyblock.py:4277 WARNING build failed (first 300 chars): Sanity check failed: sanity check command OMPI_MCA_rmaps_base_oversubscribe=1 mpirun -n 8 /dev/shm/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_hello_c exited with code 1 (output: -- A requested component was not found, or was unable to be ==
[easybuild] OpenMPI-4.1.4-GCC-12.2.0.eb Sanity check failed on AMD "Genoa" node
I'm starting EasyBuild up on our new AMD "Genoa" platform with 1 AMD EPYC 9124 16-Core Processor with 2 threads/core, 384 GB RAM, Ethernet network only, and AlmaLinux 8.8 OS. Unfortunately, building the foss-2022b toolchain exits during the testing phase of OpenMPI-4.1.4-GCC-12.2.0.eb as shown below. Does anyone have ideas about what might be wrong? $ eb foss-2022b.eb -r (lines deleted) == processing EasyBuild easyconfig /home/modules/software/EasyBuild/4.8.1/easybuild/easyconfigs/o/OpenMPI/OpenMPI-4.1.4-GCC-12.2.0.eb == building and installing OpenMPI/4.1.4-GCC-12.2.0... == fetching files... == ... (took 1 secs) == creating build dir, resetting environment... == unpacking... == ... (took 1 secs) == patching... == preparing... == configuring... == ... (took 2 mins 22 secs) == building... == ... (took 4 mins 24 secs) == testing... == ... (took 36 secs) == installing... == ... (took 1 min 15 secs) == taking care of extensions... == restore after iterating... == postprocessing... == sanity checking... == ... (took 5 secs) == FAILED: Installation ended unsuccessfully (build directory: /dev/shm/OpenMPI/4.1.4/GCC-12.2.0): build failed (first 300 chars): Sanity check failed: sanity check command OMPI_MCA_rmaps_base_oversubscribe=1 mpirun -n 8 /dev/shm/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_hello_c exited with code 1 (output: -- A requested component was not found, or was unable to be (took 8 mins 48 secs) == Results of the build can be found in the log file(s) /tmp/eb-watuyqhw/easybuild-OpenMPI-4.1.4-20230926.080727.GEZtD.log ERROR: Build of /home/modules/software/EasyBuild/4.8.1/easybuild/easyconfigs/o/OpenMPI/OpenMPI-4.1.4-GCC-12.2.0.eb failed (err: 'build failed (first 300 chars): Sanity check failed: sanity check command OMPI_MCA_rmaps_base_oversubscribe=1 mpirun -n 8 /dev/shm/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_hello_c exited with code 1 (output: --\nA requested component was not found, or was unable to be') The log file shows messages about missing components: (lines deleted) -- [e000.nifl.fysik.dtu.dk:1849636] PML cm cannot be selected [e000.nifl.fysik.dtu.dk:1849635] PML cm cannot be selected [e000.nifl.fysik.dtu.dk:1849626] 2 more processes have sent help message help-mca-base.txt / find-available:not-valid [e000.nifl.fysik.dtu.dk:1849626] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [e000.nifl.fysik.dtu.dk:1849626] 1 more process has sent help message help-mca-base.txt / find-available:none found ) sanity check command mpirun -n 1 /dev/shm/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_hello_usempi exited with code 1 (output: -- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: e000.nifl.fysik.dtu.dk Framework: mtl Component: psm2 -- -- No components were able to be opened in the pml framework. This typically means that either no components of this type were installed, or none of the installed components can be loaded. Sometimes this means that shared libraries required by these components are unable to be found/loaded. Host: e000 Framework: pml -- [e000.nifl.fysik.dtu.dk:1849661] PML cm cannot be selected ) (at easybuild/framework/easyblock.py:3655 in _sanity_check_step) == 2023-09-26 08:16:16,111 build_log.py:267 INFO ... (took 5 secs) == 2023-09-26 08:16:16,111 filetools.py:2012 INFO Removing lock /home/modules/software/.locks/_home_modules_software_OpenMPI_4.1.4-GCC-12.2.0.lock... == 2023-09-26 08:16:16,112 filetools.py:383 INFO Path /home/modules/software/.locks/_home_modules_software_OpenMPI_4.1.4-GCC-12.2.0.lock successfully removed. == 2023-09-26 08:16:16,112 filetools.py:2016 INFO Lock removed: /home/modules/software/.locks/_home_modules_software_OpenMPI_4.1.4-GCC-12.2.0.lock == 2023-09-26 08:16:16,112 easyblock.py:4277 WARNING build failed (first 300 chars): Sanity check failed: sanity check command OMPI_MCA_rmaps_base_oversubscribe=1 mpirun -n 8 /dev/shm/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_hello_c exited with code 1 (output: -- A requested component was not found, or was unable to be == 2023-09-26 08:16:16,112 easyblock.py:328 INFO Closing log