It could be a Slurm issue, but I'm seeing one thing that makes me suspicious that this might be a problem reported elsewhere.
Andrej - what version of Slurm are you using here? > On Feb 1, 2021, at 5:34 PM, Gilles Gouaillardet via devel > <devel@lists.open-mpi.org> wrote: > > Andrej, > > that really looks like a SLURM issue that does not involve Open MPI > > In order to confirm, you can > > $ salloc -N 2 -n 2 > /* and then from the allocation */ > srun hostname > > If this does not work, then this is a SLURM issue you have to fix. > Once fixed, I am confident Open MPI will just work > > Cheers, > > Gilles > > On Tue, Feb 2, 2021 at 10:22 AM Andrej Prsa via devel > <devel@lists.open-mpi.org> wrote: >> >> Hi Gilles, >> >>> Here is what you can try >>> >>> $ salloc -N 4 -n 384 >>> /* and then from the allocation */ >>> >>> $ srun -n 1 orted >>> /* that should fail, but the error message can be helpful */ >>> >>> $ /usr/local/bin/mpirun --mca plm slurm --mca plm_base_verbose 10 true >> >> andrej@terra:~/system/tests/MPI$ salloc -N 4 -n 384 >> salloc: Granted job allocation 837 >> andrej@terra:~/system/tests/MPI$ srun -n 1 orted >> srun: Warning: can't run 1 processes on 4 nodes, setting nnodes to 1 >> srun: launch/slurm: launch_p_step_launch: StepId=837.0 aborted before >> step completely launched. >> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. >> srun: error: task 0 launch failed: Unspecified error >> andrej@terra:~/system/tests/MPI$ /usr/local/bin/mpirun -mca plm slurm >> -mca plm_base_verbose 10 true >> [terra:179991] mca: base: components_register: registering framework plm >> components >> [terra:179991] mca: base: components_register: found loaded component slurm >> [terra:179991] mca: base: components_register: component slurm register >> function successful >> [terra:179991] mca: base: components_open: opening plm components >> [terra:179991] mca: base: components_open: found loaded component slurm >> [terra:179991] mca: base: components_open: component slurm open function >> successful >> [terra:179991] mca:base:select: Auto-selecting plm components >> [terra:179991] mca:base:select:( plm) Querying component [slurm] >> [terra:179991] [[INVALID],INVALID] plm:slurm: available for selection >> [terra:179991] mca:base:select:( plm) Query of component [slurm] set >> priority to 75 >> [terra:179991] mca:base:select:( plm) Selected component [slurm] >> [terra:179991] plm:base:set_hnp_name: initial bias 179991 nodename hash >> 2928217987 >> [terra:179991] plm:base:set_hnp_name: final jobfam 7711 >> [terra:179991] [[7711,0],0] plm:base:receive start comm >> [terra:179991] [[7711,0],0] plm:base:setup_job >> [terra:179991] [[7711,0],0] plm:slurm: LAUNCH DAEMONS CALLED >> [terra:179991] [[7711,0],0] plm:base:setup_vm >> [terra:179991] [[7711,0],0] plm:base:setup_vm creating map >> [terra:179991] [[7711,0],0] plm:base:setup_vm add new daemon [[7711,0],1] >> [terra:179991] [[7711,0],0] plm:base:setup_vm assigning new daemon >> [[7711,0],1] to node node9 >> [terra:179991] [[7711,0],0] plm:base:setup_vm add new daemon [[7711,0],2] >> [terra:179991] [[7711,0],0] plm:base:setup_vm assigning new daemon >> [[7711,0],2] to node node10 >> [terra:179991] [[7711,0],0] plm:base:setup_vm add new daemon [[7711,0],3] >> [terra:179991] [[7711,0],0] plm:base:setup_vm assigning new daemon >> [[7711,0],3] to node node11 >> [terra:179991] [[7711,0],0] plm:base:setup_vm add new daemon [[7711,0],4] >> [terra:179991] [[7711,0],0] plm:base:setup_vm assigning new daemon >> [[7711,0],4] to node node12 >> [terra:179991] [[7711,0],0] plm:slurm: launching on nodes >> node9,node10,node11,node12 >> [terra:179991] [[7711,0],0] plm:slurm: Set prefix:/usr/local >> [terra:179991] [[7711,0],0] plm:slurm: final top-level argv: >> srun --ntasks-per-node=1 --kill-on-bad-exit --ntasks=4 orted -mca >> ess "slurm" -mca ess_base_jobid "505348096" -mca ess_base_vpid "1" -mca >> ess_base_num_procs "5" -mca orte_node_regex >> "terra,node[1:9],node[2:10-12]@0(5)" -mca orte_hnp_uri >> "505348096.0;tcp://10.9.2.10,192.168.1.1:38995" -mca plm_base_verbose "10" >> [terra:179991] [[7711,0],0] plm:slurm: reset PATH: >> /usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin >> [terra:179991] [[7711,0],0] plm:slurm: reset LD_LIBRARY_PATH: /usr/local/lib >> srun: launch/slurm: launch_p_step_launch: StepId=837.1 aborted before >> step completely launched. >> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. >> srun: error: task 3 launch failed: Unspecified error >> srun: error: task 1 launch failed: Unspecified error >> srun: error: task 2 launch failed: Unspecified error >> srun: error: task 0 launch failed: Unspecified error >> [terra:179991] [[7711,0],0] plm:slurm: primary daemons complete! >> [terra:179991] [[7711,0],0] plm:base:receive stop comm >> [terra:179991] mca: base: close: component slurm closed >> [terra:179991] mca: base: close: unloading component slurm >> >> This is what I'm seeing in slurmctld.log: >> >> [2021-02-01T20:15:18.358] sched: _slurm_rpc_allocate_resources JobId=837 >> NodeList=node[9-12] usec=537 >> [2021-02-01T20:15:26.815] error: mpi_hook_slurmstepd_prefork failure for >> 0x557ce5b92960s on node9 >> [2021-02-01T20:15:59.621] error: mpi_hook_slurmstepd_prefork failure for >> 0x55cc6c89a7e0s on node12 >> [2021-02-01T20:15:59.621] error: mpi_hook_slurmstepd_prefork failure for >> 0x55b7b8b467e0s on node10 >> [2021-02-01T20:15:59.622] error: mpi_hook_slurmstepd_prefork failure for >> 0x55f8cd69a7e0s on node11 >> [2021-02-01T20:15:59.628] error: mpi_hook_slurmstepd_prefork failure for >> 0x5555b45bc7e0s on node9 >> >> And this is in slurmd.node9.log (and similar for the remaining 3 nodes): >> >> [2021-02-01T20:15:59.592] task/affinity: lllp_distribution: JobId=837 >> manual binding: none >> [2021-02-01T20:15:59.624] [837.1] error: node9 [0] pmixp_client_v2.c:246 >> [pmixp_lib_init] mpi/pmix: ERROR: PMIx_server_init failed with error -2 >> : Success (0) >> [2021-02-01T20:15:59.624] [837.1] error: node9 [0] pmixp_client.c:518 >> [pmixp_libpmix_init] mpi/pmix: ERROR: PMIx_server_init failed with error -1 >> : Success (0) >> [2021-02-01T20:15:59.624] [837.1] error: node9 [0] pmixp_server.c:423 >> [pmixp_stepd_init] mpi/pmix: ERROR: pmixp_libpmix_init() failed >> [2021-02-01T20:15:59.624] [837.1] error: (null) [0] mpi_pmix.c:169 >> [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed >> [2021-02-01T20:15:59.627] [837.1] error: Failed mpi_hook_slurmstepd_prefork >> [2021-02-01T20:15:59.650] [837.1] error: job_manager exiting abnormally, >> rc = -1 >> [2021-02-01T20:16:02.000] [837.1] done with job >> >> Cheers, >> Andrej >>