Here is what you can try $ salloc -N 4 -n 384 /* and then from the allocation */
$ srun -n 1 orted /* that should fail, but the error message can be helpful */ $ /usr/local/bin/mpirun --mca plm slurm --mca plm_base_verbose 10 true Cheers Gilles On Tue, Feb 2, 2021 at 10:03 AM Andrej Prsa via devel <devel@lists.open-mpi.org> wrote: > > Hi Gilles, > > andrej@terra:~/system/tests/MPI$ salloc -N 4 -n 384 > salloc: Granted job allocation 836 > andrej@terra:~/system/tests/MPI$ env | grep ^SLURM_ > SLURM_TASKS_PER_NODE=96(x4) > SLURM_SUBMIT_DIR=/home/users/andrej/system/tests/MPI > SLURM_NODE_ALIASES=(null) > SLURM_CLUSTER_NAME=terra > SLURM_JOB_CPUS_PER_NODE=96(x4) > SLURM_JOB_PARTITION=intel96 > SLURM_JOB_NUM_NODES=4 > SLURM_JOBID=836 > SLURM_JOB_QOS=normal > SLURM_NTASKS=384 > SLURM_NODELIST=node[9-12] > SLURM_NPROCS=384 > SLURM_NNODES=4 > SLURM_SUBMIT_HOST=terra > SLURM_JOB_ID=836 > SLURM_CONF=/etc/slurm.conf > SLURM_JOB_NAME=interactive > SLURM_JOB_NODELIST=node[9-12] > andrej@terra:~/system/tests/MPI$ mpirun -mca plm_base_verbose 10 -mca > plt slurm true > [terra:177267] mca: base: components_register: registering framework plm > components > [terra:177267] mca: base: components_register: found loaded component > isolated > [terra:177267] mca: base: components_register: component isolated has no > register or open function > [terra:177267] mca: base: components_register: found loaded component rsh > [terra:177267] mca: base: components_register: component rsh register > function successful > [terra:177267] mca: base: components_register: found loaded component slurm > [terra:177267] mca: base: components_register: component slurm register > function successful > [terra:177267] mca: base: components_open: opening plm components > [terra:177267] mca: base: components_open: found loaded component isolated > [terra:177267] mca: base: components_open: component isolated open > function successful > [terra:177267] mca: base: components_open: found loaded component rsh > [terra:177267] mca: base: components_open: component rsh open function > successful > [terra:177267] mca: base: components_open: found loaded component slurm > [terra:177267] mca: base: components_open: component slurm open function > successful > [terra:177267] mca:base:select: Auto-selecting plm components > [terra:177267] mca:base:select:( plm) Querying component [isolated] > [terra:177267] mca:base:select:( plm) Query of component [isolated] set > priority to 0 > [terra:177267] mca:base:select:( plm) Querying component [rsh] > [terra:177267] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh > path NULL > [terra:177267] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > [terra:177267] mca:base:select:( plm) Querying component [slurm] > [terra:177267] [[INVALID],INVALID] plm:slurm: available for selection > [terra:177267] mca:base:select:( plm) Query of component [slurm] set > priority to 75 > [terra:177267] mca:base:select:( plm) Selected component [slurm] > [terra:177267] mca: base: close: component isolated closed > [terra:177267] mca: base: close: unloading component isolated > [terra:177267] mca: base: close: component rsh closed > [terra:177267] mca: base: close: unloading component rsh > [terra:177267] plm:base:set_hnp_name: initial bias 177267 nodename hash > 2928217987 > [terra:177267] plm:base:set_hnp_name: final jobfam 5499 > [terra:177267] [[5499,0],0] plm:base:receive start comm > [terra:177267] [[5499,0],0] plm:base:setup_job > [terra:177267] [[5499,0],0] plm:slurm: LAUNCH DAEMONS CALLED > [terra:177267] [[5499,0],0] plm:base:setup_vm > [terra:177267] [[5499,0],0] plm:base:setup_vm creating map > [terra:177267] [[5499,0],0] plm:base:setup_vm add new daemon [[5499,0],1] > [terra:177267] [[5499,0],0] plm:base:setup_vm assigning new daemon > [[5499,0],1] to node node9 > [terra:177267] [[5499,0],0] plm:base:setup_vm add new daemon [[5499,0],2] > [terra:177267] [[5499,0],0] plm:base:setup_vm assigning new daemon > [[5499,0],2] to node node10 > [terra:177267] [[5499,0],0] plm:base:setup_vm add new daemon [[5499,0],3] > [terra:177267] [[5499,0],0] plm:base:setup_vm assigning new daemon > [[5499,0],3] to node node11 > [terra:177267] [[5499,0],0] plm:base:setup_vm add new daemon [[5499,0],4] > [terra:177267] [[5499,0],0] plm:base:setup_vm assigning new daemon > [[5499,0],4] to node node12 > [terra:177267] [[5499,0],0] plm:slurm: launching on nodes > node9,node10,node11,node12 > [terra:177267] [[5499,0],0] plm:slurm: final top-level argv: > srun --ntasks-per-node=1 --kill-on-bad-exit --ntasks=4 orted -mca > ess "slurm" -mca ess_base_jobid "360382464" -mca ess_base_vpid "1" -mca > ess_base_num_procs "5" -mca orte_node_regex > "terra,node[1:9],node[2:10-12]@0(5)" -mca orte_hnp_uri > "360382464.0;tcp://10.9.2.10,192.168.1.1:45597" -mca plm_base_verbose > "10" -mca plt "slurm" > srun: launch/slurm: launch_p_step_launch: StepId=836.0 aborted before > step completely launched. > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > srun: error: task 0 launch failed: Unspecified error > srun: error: task 3 launch failed: Unspecified error > srun: error: task 2 launch failed: Unspecified error > srun: error: task 1 launch failed: Unspecified error > [terra:177267] [[5499,0],0] plm:slurm: primary daemons complete! > [terra:177267] [[5499,0],0] plm:base:receive stop comm > [terra:177267] mca: base: close: component slurm closed > [terra:177267] mca: base: close: unloading component slurm > > Thanks, as always, > Andrej > > > On 2/1/21 7:50 PM, Gilles Gouaillardet via devel wrote: > > Andrej, > > > > you *have* to invoke > > mpirun --mca plm slurm ... > > from a SLURM allocation, and SLURM_* environment variables should have > > been set by SLURM > > (otherwise, this is a SLURM error out of the scope of Open MPI). > > > > Here is what you can try (and send the logs if that fails) > > > > $ salloc -N 4 -n 384 > > and once you get the allocation > > $ env | grep ^SLURM_ > > $ mpirun --mca plm_base_verbose 10 --mca plm slurm true > > > > > > Cheers, > > > > Gilles > > > > On Tue, Feb 2, 2021 at 9:27 AM Andrej Prsa via devel > > <devel@lists.open-mpi.org> wrote: > >> Hi Gilles, > >> > >>> I can reproduce this behavior ... when running outside of a slurm > >>> allocation. > >> I just tried from slurm (sbatch run.sh) and I get the exact same error. > >> > >>> What does > >>> $ env | grep ^SLURM_ > >>> reports? > >> Empty; no environment variables have been defined. > >> > >> Thanks, > >> Andrej > >> >