Hi Gilles,

andrej@terra:~/system/tests/MPI$ salloc -N 4 -n 384
salloc: Granted job allocation 836
andrej@terra:~/system/tests/MPI$ env | grep ^SLURM_
SLURM_TASKS_PER_NODE=96(x4)
SLURM_SUBMIT_DIR=/home/users/andrej/system/tests/MPI
SLURM_NODE_ALIASES=(null)
SLURM_CLUSTER_NAME=terra
SLURM_JOB_CPUS_PER_NODE=96(x4)
SLURM_JOB_PARTITION=intel96
SLURM_JOB_NUM_NODES=4
SLURM_JOBID=836
SLURM_JOB_QOS=normal
SLURM_NTASKS=384
SLURM_NODELIST=node[9-12]
SLURM_NPROCS=384
SLURM_NNODES=4
SLURM_SUBMIT_HOST=terra
SLURM_JOB_ID=836
SLURM_CONF=/etc/slurm.conf
SLURM_JOB_NAME=interactive
SLURM_JOB_NODELIST=node[9-12]
andrej@terra:~/system/tests/MPI$ mpirun -mca plm_base_verbose 10 -mca plt slurm true [terra:177267] mca: base: components_register: registering framework plm components [terra:177267] mca: base: components_register: found loaded component isolated [terra:177267] mca: base: components_register: component isolated has no register or open function
[terra:177267] mca: base: components_register: found loaded component rsh
[terra:177267] mca: base: components_register: component rsh register function successful
[terra:177267] mca: base: components_register: found loaded component slurm
[terra:177267] mca: base: components_register: component slurm register function successful
[terra:177267] mca: base: components_open: opening plm components
[terra:177267] mca: base: components_open: found loaded component isolated
[terra:177267] mca: base: components_open: component isolated open function successful
[terra:177267] mca: base: components_open: found loaded component rsh
[terra:177267] mca: base: components_open: component rsh open function successful
[terra:177267] mca: base: components_open: found loaded component slurm
[terra:177267] mca: base: components_open: component slurm open function successful
[terra:177267] mca:base:select: Auto-selecting plm components
[terra:177267] mca:base:select:(  plm) Querying component [isolated]
[terra:177267] mca:base:select:(  plm) Query of component [isolated] set priority to 0
[terra:177267] mca:base:select:(  plm) Querying component [rsh]
[terra:177267] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [terra:177267] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[terra:177267] mca:base:select:(  plm) Querying component [slurm]
[terra:177267] [[INVALID],INVALID] plm:slurm: available for selection
[terra:177267] mca:base:select:(  plm) Query of component [slurm] set priority to 75
[terra:177267] mca:base:select:(  plm) Selected component [slurm]
[terra:177267] mca: base: close: component isolated closed
[terra:177267] mca: base: close: unloading component isolated
[terra:177267] mca: base: close: component rsh closed
[terra:177267] mca: base: close: unloading component rsh
[terra:177267] plm:base:set_hnp_name: initial bias 177267 nodename hash 2928217987
[terra:177267] plm:base:set_hnp_name: final jobfam 5499
[terra:177267] [[5499,0],0] plm:base:receive start comm
[terra:177267] [[5499,0],0] plm:base:setup_job
[terra:177267] [[5499,0],0] plm:slurm: LAUNCH DAEMONS CALLED
[terra:177267] [[5499,0],0] plm:base:setup_vm
[terra:177267] [[5499,0],0] plm:base:setup_vm creating map
[terra:177267] [[5499,0],0] plm:base:setup_vm add new daemon [[5499,0],1]
[terra:177267] [[5499,0],0] plm:base:setup_vm assigning new daemon [[5499,0],1] to node node9
[terra:177267] [[5499,0],0] plm:base:setup_vm add new daemon [[5499,0],2]
[terra:177267] [[5499,0],0] plm:base:setup_vm assigning new daemon [[5499,0],2] to node node10
[terra:177267] [[5499,0],0] plm:base:setup_vm add new daemon [[5499,0],3]
[terra:177267] [[5499,0],0] plm:base:setup_vm assigning new daemon [[5499,0],3] to node node11
[terra:177267] [[5499,0],0] plm:base:setup_vm add new daemon [[5499,0],4]
[terra:177267] [[5499,0],0] plm:base:setup_vm assigning new daemon [[5499,0],4] to node node12 [terra:177267] [[5499,0],0] plm:slurm: launching on nodes node9,node10,node11,node12
[terra:177267] [[5499,0],0] plm:slurm: final top-level argv:
    srun --ntasks-per-node=1 --kill-on-bad-exit --ntasks=4 orted -mca ess "slurm" -mca ess_base_jobid "360382464" -mca ess_base_vpid "1" -mca ess_base_num_procs "5" -mca orte_node_regex "terra,node[1:9],node[2:10-12]@0(5)" -mca orte_hnp_uri "360382464.0;tcp://10.9.2.10,192.168.1.1:45597" -mca plm_base_verbose "10" -mca plt "slurm" srun: launch/slurm: launch_p_step_launch: StepId=836.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: task 0 launch failed: Unspecified error
srun: error: task 3 launch failed: Unspecified error
srun: error: task 2 launch failed: Unspecified error
srun: error: task 1 launch failed: Unspecified error
[terra:177267] [[5499,0],0] plm:slurm: primary daemons complete!
[terra:177267] [[5499,0],0] plm:base:receive stop comm
[terra:177267] mca: base: close: component slurm closed
[terra:177267] mca: base: close: unloading component slurm

Thanks, as always,
Andrej


On 2/1/21 7:50 PM, Gilles Gouaillardet via devel wrote:
Andrej,

you *have* to invoke
mpirun --mca plm slurm ...
from a SLURM allocation, and SLURM_* environment variables should have
been set by SLURM
(otherwise, this is a SLURM error out of the scope of Open MPI).

Here is what you can try (and send the logs if that fails)

$ salloc -N 4 -n 384
and once you get the allocation
$ env | grep ^SLURM_
$ mpirun --mca plm_base_verbose 10 --mca plm slurm true


Cheers,

Gilles

On Tue, Feb 2, 2021 at 9:27 AM Andrej Prsa via devel
<devel@lists.open-mpi.org> wrote:
Hi Gilles,

I can reproduce this behavior ... when running outside of a slurm allocation.
I just tried from slurm (sbatch run.sh) and I get the exact same error.

What does
$ env | grep ^SLURM_
reports?
Empty; no environment variables have been defined.

Thanks,
Andrej


Reply via email to