Here is what you can try

$ salloc -N 4 -n 384
/* and then from the allocation */

$ srun -n 1 orted
/* that should fail, but the error message can be helpful */

$ /usr/local/bin/mpirun --mca plm slurm --mca plm_base_verbose 10 true


Cheers

Gilles

On Tue, Feb 2, 2021 at 10:03 AM Andrej Prsa via devel
<devel@lists.open-mpi.org> wrote:
>
> Hi Gilles,
>
> andrej@terra:~/system/tests/MPI$ salloc -N 4 -n 384
> salloc: Granted job allocation 836
> andrej@terra:~/system/tests/MPI$ env | grep ^SLURM_
> SLURM_TASKS_PER_NODE=96(x4)
> SLURM_SUBMIT_DIR=/home/users/andrej/system/tests/MPI
> SLURM_NODE_ALIASES=(null)
> SLURM_CLUSTER_NAME=terra
> SLURM_JOB_CPUS_PER_NODE=96(x4)
> SLURM_JOB_PARTITION=intel96
> SLURM_JOB_NUM_NODES=4
> SLURM_JOBID=836
> SLURM_JOB_QOS=normal
> SLURM_NTASKS=384
> SLURM_NODELIST=node[9-12]
> SLURM_NPROCS=384
> SLURM_NNODES=4
> SLURM_SUBMIT_HOST=terra
> SLURM_JOB_ID=836
> SLURM_CONF=/etc/slurm.conf
> SLURM_JOB_NAME=interactive
> SLURM_JOB_NODELIST=node[9-12]
> andrej@terra:~/system/tests/MPI$ mpirun -mca plm_base_verbose 10 -mca
> plt slurm true
> [terra:177267] mca: base: components_register: registering framework plm
> components
> [terra:177267] mca: base: components_register: found loaded component
> isolated
> [terra:177267] mca: base: components_register: component isolated has no
> register or open function
> [terra:177267] mca: base: components_register: found loaded component rsh
> [terra:177267] mca: base: components_register: component rsh register
> function successful
> [terra:177267] mca: base: components_register: found loaded component slurm
> [terra:177267] mca: base: components_register: component slurm register
> function successful
> [terra:177267] mca: base: components_open: opening plm components
> [terra:177267] mca: base: components_open: found loaded component isolated
> [terra:177267] mca: base: components_open: component isolated open
> function successful
> [terra:177267] mca: base: components_open: found loaded component rsh
> [terra:177267] mca: base: components_open: component rsh open function
> successful
> [terra:177267] mca: base: components_open: found loaded component slurm
> [terra:177267] mca: base: components_open: component slurm open function
> successful
> [terra:177267] mca:base:select: Auto-selecting plm components
> [terra:177267] mca:base:select:(  plm) Querying component [isolated]
> [terra:177267] mca:base:select:(  plm) Query of component [isolated] set
> priority to 0
> [terra:177267] mca:base:select:(  plm) Querying component [rsh]
> [terra:177267] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
> path NULL
> [terra:177267] mca:base:select:(  plm) Query of component [rsh] set
> priority to 10
> [terra:177267] mca:base:select:(  plm) Querying component [slurm]
> [terra:177267] [[INVALID],INVALID] plm:slurm: available for selection
> [terra:177267] mca:base:select:(  plm) Query of component [slurm] set
> priority to 75
> [terra:177267] mca:base:select:(  plm) Selected component [slurm]
> [terra:177267] mca: base: close: component isolated closed
> [terra:177267] mca: base: close: unloading component isolated
> [terra:177267] mca: base: close: component rsh closed
> [terra:177267] mca: base: close: unloading component rsh
> [terra:177267] plm:base:set_hnp_name: initial bias 177267 nodename hash
> 2928217987
> [terra:177267] plm:base:set_hnp_name: final jobfam 5499
> [terra:177267] [[5499,0],0] plm:base:receive start comm
> [terra:177267] [[5499,0],0] plm:base:setup_job
> [terra:177267] [[5499,0],0] plm:slurm: LAUNCH DAEMONS CALLED
> [terra:177267] [[5499,0],0] plm:base:setup_vm
> [terra:177267] [[5499,0],0] plm:base:setup_vm creating map
> [terra:177267] [[5499,0],0] plm:base:setup_vm add new daemon [[5499,0],1]
> [terra:177267] [[5499,0],0] plm:base:setup_vm assigning new daemon
> [[5499,0],1] to node node9
> [terra:177267] [[5499,0],0] plm:base:setup_vm add new daemon [[5499,0],2]
> [terra:177267] [[5499,0],0] plm:base:setup_vm assigning new daemon
> [[5499,0],2] to node node10
> [terra:177267] [[5499,0],0] plm:base:setup_vm add new daemon [[5499,0],3]
> [terra:177267] [[5499,0],0] plm:base:setup_vm assigning new daemon
> [[5499,0],3] to node node11
> [terra:177267] [[5499,0],0] plm:base:setup_vm add new daemon [[5499,0],4]
> [terra:177267] [[5499,0],0] plm:base:setup_vm assigning new daemon
> [[5499,0],4] to node node12
> [terra:177267] [[5499,0],0] plm:slurm: launching on nodes
> node9,node10,node11,node12
> [terra:177267] [[5499,0],0] plm:slurm: final top-level argv:
>      srun --ntasks-per-node=1 --kill-on-bad-exit --ntasks=4 orted -mca
> ess "slurm" -mca ess_base_jobid "360382464" -mca ess_base_vpid "1" -mca
> ess_base_num_procs "5" -mca orte_node_regex
> "terra,node[1:9],node[2:10-12]@0(5)" -mca orte_hnp_uri
> "360382464.0;tcp://10.9.2.10,192.168.1.1:45597" -mca plm_base_verbose
> "10" -mca plt "slurm"
> srun: launch/slurm: launch_p_step_launch: StepId=836.0 aborted before
> step completely launched.
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: task 0 launch failed: Unspecified error
> srun: error: task 3 launch failed: Unspecified error
> srun: error: task 2 launch failed: Unspecified error
> srun: error: task 1 launch failed: Unspecified error
> [terra:177267] [[5499,0],0] plm:slurm: primary daemons complete!
> [terra:177267] [[5499,0],0] plm:base:receive stop comm
> [terra:177267] mca: base: close: component slurm closed
> [terra:177267] mca: base: close: unloading component slurm
>
> Thanks, as always,
> Andrej
>
>
> On 2/1/21 7:50 PM, Gilles Gouaillardet via devel wrote:
> > Andrej,
> >
> > you *have* to invoke
> > mpirun --mca plm slurm ...
> > from a SLURM allocation, and SLURM_* environment variables should have
> > been set by SLURM
> > (otherwise, this is a SLURM error out of the scope of Open MPI).
> >
> > Here is what you can try (and send the logs if that fails)
> >
> > $ salloc -N 4 -n 384
> > and once you get the allocation
> > $ env | grep ^SLURM_
> > $ mpirun --mca plm_base_verbose 10 --mca plm slurm true
> >
> >
> > Cheers,
> >
> > Gilles
> >
> > On Tue, Feb 2, 2021 at 9:27 AM Andrej Prsa via devel
> > <devel@lists.open-mpi.org> wrote:
> >> Hi Gilles,
> >>
> >>> I can reproduce this behavior ... when running outside of a slurm 
> >>> allocation.
> >> I just tried from slurm (sbatch run.sh) and I get the exact same error.
> >>
> >>> What does
> >>> $ env | grep ^SLURM_
> >>> reports?
> >> Empty; no environment variables have been defined.
> >>
> >> Thanks,
> >> Andrej
> >>
>

Reply via email to