Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems
Interesting - good to know. Thanks On Feb 12, 2014, at 10:38 AM, Adrian Reberwrote: > It seems this is indeed a Moab bug for interactive jobs. At least a bug > was opened against moab. Using non-interactive jobs the variables have > the correct values and mpirun has no problems detecting the correct > number of cores. > > On Wed, Feb 12, 2014 at 07:50:40AM -0800, Ralph Castain wrote: >> Another possibility to check - it is entirely possible that Moab is >> miscommunicating the values to Slurm. You might need to check it - I'll >> install a copy of 2.6.5 on my machines and see if I get similar issues when >> Slurm does the allocation itself. >> >> On Feb 12, 2014, at 7:47 AM, Ralph Castain wrote: >> >>> >>> On Feb 12, 2014, at 7:32 AM, Adrian Reber wrote: >>> $ msub -I -l nodes=3:ppn=8 salloc: Job is in held state, pending scheduler release salloc: Pending job allocation 131828 salloc: job 131828 queued and waiting for resources salloc: job 131828 has been allocated resources salloc: Granted job allocation 131828 sh-4.1$ echo $SLURM_TASKS_PER_NODE 1 sh-4.1$ rpm -q slurm slurm-2.6.5-1.el6.x86_64 sh-4.1$ echo $SLURM_NNODES 1 sh-4.1$ echo $SLURM_JOB_NODELIST [107-108,176] sh-4.1$ echo $SLURM_JOB_CPUS_PER_NODE 8(x3) sh-4.1$ echo $SLURM_NODELIST [107-108,176] sh-4.1$ echo $SLURM_NPROCS 1 sh-4.1$ echo $SLURM_NTASKS 1 sh-4.1$ echo $SLURM_TASKS_PER_NODE 1 The information in *_NODELIST seems to make sense, but all the other variables (PROCS, TASKS, NODES) report '1', which seems wrong. >>> >>> Indeed - and that's the problem. Slurm 2.6.5 is the most recent release, >>> and my guess is that SchedMD once again has changed the @$!#%#@ meaning of >>> their envars. Frankly, it is nearly impossible to track all the variants >>> they have created over the years. >>> >>> Please check to see if someone did a little customizing on your end as >>> sometimes people do that to Slurm. Could also be they did something in the >>> Slurm config file that is causing the changed behavior. >>> >>> Meantime, I'll try to ponder a potential solution in case this really is >>> the "latest" Slurm screwup. >>> >>> On Wed, Feb 12, 2014 at 07:19:54AM -0800, Ralph Castain wrote: > ...and your version of Slurm? > > On Feb 12, 2014, at 7:19 AM, Ralph Castain wrote: > >> What is your SLURM_TASKS_PER_NODE? >> >> On Feb 12, 2014, at 6:58 AM, Adrian Reber wrote: >> >>> No, the system has only a few MOAB_* variables and many SLURM_* >>> variables: >>> >>> $BASH $IFS $SECONDS >>> $SLURM_PTY_PORT >>> $BASHOPTS $LINENO $SHELL >>> $SLURM_PTY_WIN_COL >>> $BASHPID $LINES$SHELLOPTS >>> $SLURM_PTY_WIN_ROW >>> $BASH_ALIASES $MACHTYPE $SHLVL >>> $SLURM_SRUN_COMM_HOST >>> $BASH_ARGC$MAILCHECK >>> $SLURMD_NODENAME $SLURM_SRUN_COMM_PORT >>> $BASH_ARGV$MOAB_CLASS >>> $SLURM_CHECKPOINT_IMAGE_DIR $SLURM_STEPID >>> $BASH_CMDS$MOAB_GROUP $SLURM_CONF >>> $SLURM_STEP_ID >>> $BASH_COMMAND $MOAB_JOBID >>> $SLURM_CPUS_ON_NODE $SLURM_STEP_LAUNCHER_PORT >>> $BASH_LINENO $MOAB_NODECOUNT >>> $SLURM_DISTRIBUTION $SLURM_STEP_NODELIST >>> $BASH_SOURCE $MOAB_PARTITION >>> $SLURM_GTIDS $SLURM_STEP_NUM_NODES >>> $BASH_SUBSHELL$MOAB_PROCCOUNT >>> $SLURM_JOBID $SLURM_STEP_NUM_TASKS >>> $BASH_VERSINFO$MOAB_SUBMITDIR >>> $SLURM_JOB_CPUS_PER_NODE $SLURM_STEP_TASKS_PER_NODE >>> $BASH_VERSION $MOAB_USER >>> $SLURM_JOB_ID $SLURM_SUBMIT_DIR >>> $COLUMNS $OPTERR >>> $SLURM_JOB_NODELIST $SLURM_SUBMIT_HOST >>> $COMP_WORDBREAKS $OPTIND >>> $SLURM_JOB_NUM_NODES $SLURM_TASKS_PER_NODE >>> $DIRSTACK $OSTYPE >>> $SLURM_LAUNCH_NODE_IPADDR $SLURM_TASK_PID >>> $EUID $PATH >>> $SLURM_LOCALID$SLURM_TOPOLOGY_ADDR >>> $GROUPS
Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems
It seems this is indeed a Moab bug for interactive jobs. At least a bug was opened against moab. Using non-interactive jobs the variables have the correct values and mpirun has no problems detecting the correct number of cores. On Wed, Feb 12, 2014 at 07:50:40AM -0800, Ralph Castain wrote: > Another possibility to check - it is entirely possible that Moab is > miscommunicating the values to Slurm. You might need to check it - I'll > install a copy of 2.6.5 on my machines and see if I get similar issues when > Slurm does the allocation itself. > > On Feb 12, 2014, at 7:47 AM, Ralph Castainwrote: > > > > > On Feb 12, 2014, at 7:32 AM, Adrian Reber wrote: > > > >> > >> $ msub -I -l nodes=3:ppn=8 > >> salloc: Job is in held state, pending scheduler release > >> salloc: Pending job allocation 131828 > >> salloc: job 131828 queued and waiting for resources > >> salloc: job 131828 has been allocated resources > >> salloc: Granted job allocation 131828 > >> sh-4.1$ echo $SLURM_TASKS_PER_NODE > >> 1 > >> sh-4.1$ rpm -q slurm > >> slurm-2.6.5-1.el6.x86_64 > >> sh-4.1$ echo $SLURM_NNODES > >> 1 > >> sh-4.1$ echo $SLURM_JOB_NODELIST > >> [107-108,176] > >> sh-4.1$ echo $SLURM_JOB_CPUS_PER_NODE > >> 8(x3) > >> sh-4.1$ echo $SLURM_NODELIST > >> [107-108,176] > >> sh-4.1$ echo $SLURM_NPROCS > >> 1 > >> sh-4.1$ echo $SLURM_NTASKS > >> 1 > >> sh-4.1$ echo $SLURM_TASKS_PER_NODE > >> 1 > >> > >> The information in *_NODELIST seems to make sense, but all the other > >> variables (PROCS, TASKS, NODES) report '1', which seems wrong. > > > > Indeed - and that's the problem. Slurm 2.6.5 is the most recent release, > > and my guess is that SchedMD once again has changed the @$!#%#@ meaning of > > their envars. Frankly, it is nearly impossible to track all the variants > > they have created over the years. > > > > Please check to see if someone did a little customizing on your end as > > sometimes people do that to Slurm. Could also be they did something in the > > Slurm config file that is causing the changed behavior. > > > > Meantime, I'll try to ponder a potential solution in case this really is > > the "latest" Slurm screwup. > > > > > >> > >> > >> On Wed, Feb 12, 2014 at 07:19:54AM -0800, Ralph Castain wrote: > >>> ...and your version of Slurm? > >>> > >>> On Feb 12, 2014, at 7:19 AM, Ralph Castain wrote: > >>> > What is your SLURM_TASKS_PER_NODE? > > On Feb 12, 2014, at 6:58 AM, Adrian Reber wrote: > > > No, the system has only a few MOAB_* variables and many SLURM_* > > variables: > > > > $BASH $IFS $SECONDS > > $SLURM_PTY_PORT > > $BASHOPTS $LINENO $SHELL > > $SLURM_PTY_WIN_COL > > $BASHPID $LINES$SHELLOPTS > > $SLURM_PTY_WIN_ROW > > $BASH_ALIASES $MACHTYPE $SHLVL > > $SLURM_SRUN_COMM_HOST > > $BASH_ARGC$MAILCHECK > > $SLURMD_NODENAME $SLURM_SRUN_COMM_PORT > > $BASH_ARGV$MOAB_CLASS > > $SLURM_CHECKPOINT_IMAGE_DIR $SLURM_STEPID > > $BASH_CMDS$MOAB_GROUP $SLURM_CONF > > $SLURM_STEP_ID > > $BASH_COMMAND $MOAB_JOBID > > $SLURM_CPUS_ON_NODE $SLURM_STEP_LAUNCHER_PORT > > $BASH_LINENO $MOAB_NODECOUNT > > $SLURM_DISTRIBUTION $SLURM_STEP_NODELIST > > $BASH_SOURCE $MOAB_PARTITION > > $SLURM_GTIDS $SLURM_STEP_NUM_NODES > > $BASH_SUBSHELL$MOAB_PROCCOUNT > > $SLURM_JOBID $SLURM_STEP_NUM_TASKS > > $BASH_VERSINFO$MOAB_SUBMITDIR > > $SLURM_JOB_CPUS_PER_NODE $SLURM_STEP_TASKS_PER_NODE > > $BASH_VERSION $MOAB_USER > > $SLURM_JOB_ID $SLURM_SUBMIT_DIR > > $COLUMNS $OPTERR > > $SLURM_JOB_NODELIST $SLURM_SUBMIT_HOST > > $COMP_WORDBREAKS $OPTIND > > $SLURM_JOB_NUM_NODES $SLURM_TASKS_PER_NODE > > $DIRSTACK $OSTYPE > > $SLURM_LAUNCH_NODE_IPADDR $SLURM_TASK_PID > > $EUID $PATH > > $SLURM_LOCALID$SLURM_TOPOLOGY_ADDR > > $GROUPS $POSIXLY_CORRECT > > $SLURM_NNODES $SLURM_TOPOLOGY_ADDR_PATTERN > > $HISTCMD
Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems
Another possibility to check - it is entirely possible that Moab is miscommunicating the values to Slurm. You might need to check it - I'll install a copy of 2.6.5 on my machines and see if I get similar issues when Slurm does the allocation itself. On Feb 12, 2014, at 7:47 AM, Ralph Castainwrote: > > On Feb 12, 2014, at 7:32 AM, Adrian Reber wrote: > >> >> $ msub -I -l nodes=3:ppn=8 >> salloc: Job is in held state, pending scheduler release >> salloc: Pending job allocation 131828 >> salloc: job 131828 queued and waiting for resources >> salloc: job 131828 has been allocated resources >> salloc: Granted job allocation 131828 >> sh-4.1$ echo $SLURM_TASKS_PER_NODE >> 1 >> sh-4.1$ rpm -q slurm >> slurm-2.6.5-1.el6.x86_64 >> sh-4.1$ echo $SLURM_NNODES >> 1 >> sh-4.1$ echo $SLURM_JOB_NODELIST >> [107-108,176] >> sh-4.1$ echo $SLURM_JOB_CPUS_PER_NODE >> 8(x3) >> sh-4.1$ echo $SLURM_NODELIST >> [107-108,176] >> sh-4.1$ echo $SLURM_NPROCS >> 1 >> sh-4.1$ echo $SLURM_NTASKS >> 1 >> sh-4.1$ echo $SLURM_TASKS_PER_NODE >> 1 >> >> The information in *_NODELIST seems to make sense, but all the other >> variables (PROCS, TASKS, NODES) report '1', which seems wrong. > > Indeed - and that's the problem. Slurm 2.6.5 is the most recent release, and > my guess is that SchedMD once again has changed the @$!#%#@ meaning of their > envars. Frankly, it is nearly impossible to track all the variants they have > created over the years. > > Please check to see if someone did a little customizing on your end as > sometimes people do that to Slurm. Could also be they did something in the > Slurm config file that is causing the changed behavior. > > Meantime, I'll try to ponder a potential solution in case this really is the > "latest" Slurm screwup. > > >> >> >> On Wed, Feb 12, 2014 at 07:19:54AM -0800, Ralph Castain wrote: >>> ...and your version of Slurm? >>> >>> On Feb 12, 2014, at 7:19 AM, Ralph Castain wrote: >>> What is your SLURM_TASKS_PER_NODE? On Feb 12, 2014, at 6:58 AM, Adrian Reber wrote: > No, the system has only a few MOAB_* variables and many SLURM_* > variables: > > $BASH $IFS $SECONDS > $SLURM_PTY_PORT > $BASHOPTS $LINENO $SHELL > $SLURM_PTY_WIN_COL > $BASHPID $LINES$SHELLOPTS > $SLURM_PTY_WIN_ROW > $BASH_ALIASES $MACHTYPE $SHLVL > $SLURM_SRUN_COMM_HOST > $BASH_ARGC$MAILCHECK > $SLURMD_NODENAME $SLURM_SRUN_COMM_PORT > $BASH_ARGV$MOAB_CLASS > $SLURM_CHECKPOINT_IMAGE_DIR $SLURM_STEPID > $BASH_CMDS$MOAB_GROUP $SLURM_CONF > $SLURM_STEP_ID > $BASH_COMMAND $MOAB_JOBID > $SLURM_CPUS_ON_NODE $SLURM_STEP_LAUNCHER_PORT > $BASH_LINENO $MOAB_NODECOUNT > $SLURM_DISTRIBUTION $SLURM_STEP_NODELIST > $BASH_SOURCE $MOAB_PARTITION $SLURM_GTIDS > $SLURM_STEP_NUM_NODES > $BASH_SUBSHELL$MOAB_PROCCOUNT $SLURM_JOBID > $SLURM_STEP_NUM_TASKS > $BASH_VERSINFO$MOAB_SUBMITDIR > $SLURM_JOB_CPUS_PER_NODE $SLURM_STEP_TASKS_PER_NODE > $BASH_VERSION $MOAB_USER$SLURM_JOB_ID > $SLURM_SUBMIT_DIR > $COLUMNS $OPTERR > $SLURM_JOB_NODELIST $SLURM_SUBMIT_HOST > $COMP_WORDBREAKS $OPTIND > $SLURM_JOB_NUM_NODES $SLURM_TASKS_PER_NODE > $DIRSTACK $OSTYPE > $SLURM_LAUNCH_NODE_IPADDR $SLURM_TASK_PID > $EUID $PATH > $SLURM_LOCALID$SLURM_TOPOLOGY_ADDR > $GROUPS $POSIXLY_CORRECT $SLURM_NNODES > $SLURM_TOPOLOGY_ADDR_PATTERN > $HISTCMD $PPID $SLURM_NODEID > $SRUN_DEBUG > $HISTFILE $PS1 > $SLURM_NODELIST $TERM > $HISTFILESIZE $PS2 $SLURM_NPROCS > $TMPDIR > $HISTSIZE $PS4 $SLURM_NTASKS > $UID > $HOSTNAME $PWD >
Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems
On Feb 12, 2014, at 7:32 AM, Adrian Reberwrote: > > $ msub -I -l nodes=3:ppn=8 > salloc: Job is in held state, pending scheduler release > salloc: Pending job allocation 131828 > salloc: job 131828 queued and waiting for resources > salloc: job 131828 has been allocated resources > salloc: Granted job allocation 131828 > sh-4.1$ echo $SLURM_TASKS_PER_NODE > 1 > sh-4.1$ rpm -q slurm > slurm-2.6.5-1.el6.x86_64 > sh-4.1$ echo $SLURM_NNODES > 1 > sh-4.1$ echo $SLURM_JOB_NODELIST > [107-108,176] > sh-4.1$ echo $SLURM_JOB_CPUS_PER_NODE > 8(x3) > sh-4.1$ echo $SLURM_NODELIST > [107-108,176] > sh-4.1$ echo $SLURM_NPROCS > 1 > sh-4.1$ echo $SLURM_NTASKS > 1 > sh-4.1$ echo $SLURM_TASKS_PER_NODE > 1 > > The information in *_NODELIST seems to make sense, but all the other > variables (PROCS, TASKS, NODES) report '1', which seems wrong. Indeed - and that's the problem. Slurm 2.6.5 is the most recent release, and my guess is that SchedMD once again has changed the @$!#%#@ meaning of their envars. Frankly, it is nearly impossible to track all the variants they have created over the years. Please check to see if someone did a little customizing on your end as sometimes people do that to Slurm. Could also be they did something in the Slurm config file that is causing the changed behavior. Meantime, I'll try to ponder a potential solution in case this really is the "latest" Slurm screwup. > > > On Wed, Feb 12, 2014 at 07:19:54AM -0800, Ralph Castain wrote: >> ...and your version of Slurm? >> >> On Feb 12, 2014, at 7:19 AM, Ralph Castain wrote: >> >>> What is your SLURM_TASKS_PER_NODE? >>> >>> On Feb 12, 2014, at 6:58 AM, Adrian Reber wrote: >>> No, the system has only a few MOAB_* variables and many SLURM_* variables: $BASH $IFS $SECONDS $SLURM_PTY_PORT $BASHOPTS $LINENO $SHELL $SLURM_PTY_WIN_COL $BASHPID $LINES$SHELLOPTS $SLURM_PTY_WIN_ROW $BASH_ALIASES $MACHTYPE $SHLVL $SLURM_SRUN_COMM_HOST $BASH_ARGC$MAILCHECK $SLURMD_NODENAME $SLURM_SRUN_COMM_PORT $BASH_ARGV$MOAB_CLASS $SLURM_CHECKPOINT_IMAGE_DIR $SLURM_STEPID $BASH_CMDS$MOAB_GROUP $SLURM_CONF $SLURM_STEP_ID $BASH_COMMAND $MOAB_JOBID $SLURM_CPUS_ON_NODE $SLURM_STEP_LAUNCHER_PORT $BASH_LINENO $MOAB_NODECOUNT $SLURM_DISTRIBUTION $SLURM_STEP_NODELIST $BASH_SOURCE $MOAB_PARTITION $SLURM_GTIDS $SLURM_STEP_NUM_NODES $BASH_SUBSHELL$MOAB_PROCCOUNT $SLURM_JOBID $SLURM_STEP_NUM_TASKS $BASH_VERSINFO$MOAB_SUBMITDIR $SLURM_JOB_CPUS_PER_NODE $SLURM_STEP_TASKS_PER_NODE $BASH_VERSION $MOAB_USER$SLURM_JOB_ID $SLURM_SUBMIT_DIR $COLUMNS $OPTERR $SLURM_JOB_NODELIST $SLURM_SUBMIT_HOST $COMP_WORDBREAKS $OPTIND $SLURM_JOB_NUM_NODES $SLURM_TASKS_PER_NODE $DIRSTACK $OSTYPE $SLURM_LAUNCH_NODE_IPADDR $SLURM_TASK_PID $EUID $PATH $SLURM_LOCALID $SLURM_TOPOLOGY_ADDR $GROUPS $POSIXLY_CORRECT $SLURM_NNODES $SLURM_TOPOLOGY_ADDR_PATTERN $HISTCMD $PPID $SLURM_NODEID $SRUN_DEBUG $HISTFILE $PS1 $SLURM_NODELIST $TERM $HISTFILESIZE $PS2 $SLURM_NPROCS $TMPDIR $HISTSIZE $PS4 $SLURM_NTASKS $UID $HOSTNAME $PWD $SLURM_PRIO_PROCESS $_ $HOSTTYPE $RANDOM $SLURM_PROCID On Wed, Feb 12, 2014 at 06:12:45AM -0800, Ralph Castain wrote: > Seems rather odd - since this is managed by Moab, you shouldn't be seeing > SLURM envars at all. What you should see are PBS_* envars, including a > PBS_NODEFILE that actually contains the allocation. > > > On
Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems
$ msub -I -l nodes=3:ppn=8 salloc: Job is in held state, pending scheduler release salloc: Pending job allocation 131828 salloc: job 131828 queued and waiting for resources salloc: job 131828 has been allocated resources salloc: Granted job allocation 131828 sh-4.1$ echo $SLURM_TASKS_PER_NODE 1 sh-4.1$ rpm -q slurm slurm-2.6.5-1.el6.x86_64 sh-4.1$ echo $SLURM_NNODES 1 sh-4.1$ echo $SLURM_JOB_NODELIST [107-108,176] sh-4.1$ echo $SLURM_JOB_CPUS_PER_NODE 8(x3) sh-4.1$ echo $SLURM_NODELIST [107-108,176] sh-4.1$ echo $SLURM_NPROCS 1 sh-4.1$ echo $SLURM_NTASKS 1 sh-4.1$ echo $SLURM_TASKS_PER_NODE 1 The information in *_NODELIST seems to make sense, but all the other variables (PROCS, TASKS, NODES) report '1', which seems wrong. On Wed, Feb 12, 2014 at 07:19:54AM -0800, Ralph Castain wrote: > ...and your version of Slurm? > > On Feb 12, 2014, at 7:19 AM, Ralph Castainwrote: > > > What is your SLURM_TASKS_PER_NODE? > > > > On Feb 12, 2014, at 6:58 AM, Adrian Reber wrote: > > > >> No, the system has only a few MOAB_* variables and many SLURM_* > >> variables: > >> > >> $BASH $IFS $SECONDS > >>$SLURM_PTY_PORT > >> $BASHOPTS $LINENO $SHELL > >>$SLURM_PTY_WIN_COL > >> $BASHPID $LINES$SHELLOPTS > >>$SLURM_PTY_WIN_ROW > >> $BASH_ALIASES $MACHTYPE $SHLVL > >>$SLURM_SRUN_COMM_HOST > >> $BASH_ARGC$MAILCHECK > >> $SLURMD_NODENAME $SLURM_SRUN_COMM_PORT > >> $BASH_ARGV$MOAB_CLASS > >> $SLURM_CHECKPOINT_IMAGE_DIR $SLURM_STEPID > >> $BASH_CMDS$MOAB_GROUP $SLURM_CONF > >>$SLURM_STEP_ID > >> $BASH_COMMAND $MOAB_JOBID > >> $SLURM_CPUS_ON_NODE $SLURM_STEP_LAUNCHER_PORT > >> $BASH_LINENO $MOAB_NODECOUNT > >> $SLURM_DISTRIBUTION $SLURM_STEP_NODELIST > >> $BASH_SOURCE $MOAB_PARTITION $SLURM_GTIDS > >>$SLURM_STEP_NUM_NODES > >> $BASH_SUBSHELL$MOAB_PROCCOUNT $SLURM_JOBID > >>$SLURM_STEP_NUM_TASKS > >> $BASH_VERSINFO$MOAB_SUBMITDIR > >> $SLURM_JOB_CPUS_PER_NODE $SLURM_STEP_TASKS_PER_NODE > >> $BASH_VERSION $MOAB_USER$SLURM_JOB_ID > >>$SLURM_SUBMIT_DIR > >> $COLUMNS $OPTERR > >> $SLURM_JOB_NODELIST $SLURM_SUBMIT_HOST > >> $COMP_WORDBREAKS $OPTIND > >> $SLURM_JOB_NUM_NODES $SLURM_TASKS_PER_NODE > >> $DIRSTACK $OSTYPE > >> $SLURM_LAUNCH_NODE_IPADDR $SLURM_TASK_PID > >> $EUID $PATH $SLURM_LOCALID > >>$SLURM_TOPOLOGY_ADDR > >> $GROUPS $POSIXLY_CORRECT $SLURM_NNODES > >>$SLURM_TOPOLOGY_ADDR_PATTERN > >> $HISTCMD $PPID $SLURM_NODEID > >>$SRUN_DEBUG > >> $HISTFILE $PS1 > >> $SLURM_NODELIST $TERM > >> $HISTFILESIZE $PS2 $SLURM_NPROCS > >>$TMPDIR > >> $HISTSIZE $PS4 $SLURM_NTASKS > >>$UID > >> $HOSTNAME $PWD > >> $SLURM_PRIO_PROCESS $_ > >> $HOSTTYPE $RANDOM $SLURM_PROCID > >> > >> > >> > >> > >> On Wed, Feb 12, 2014 at 06:12:45AM -0800, Ralph Castain wrote: > >>> Seems rather odd - since this is managed by Moab, you shouldn't be seeing > >>> SLURM envars at all. What you should see are PBS_* envars, including a > >>> PBS_NODEFILE that actually contains the allocation. > >>> > >>> > >>> On Feb 12, 2014, at 4:42 AM, Adrian Reber wrote: > >>> > I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a system > with slurm and moab. I requested an interactive session using: > > msub -I -l nodes=3:ppn=8 > > and started a simple test case which fails: > > $ mpirun -np 2 ./mpi-test 1 > -- > There are not enough slots available in the system to satisfy the 2 > slots > that were requested by the application: > ./mpi-test > > Either request fewer slots for your application, or make more slots > available >
Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems
What is your SLURM_TASKS_PER_NODE? On Feb 12, 2014, at 6:58 AM, Adrian Reberwrote: > No, the system has only a few MOAB_* variables and many SLURM_* > variables: > > $BASH $IFS $SECONDS > $SLURM_PTY_PORT > $BASHOPTS $LINENO $SHELL > $SLURM_PTY_WIN_COL > $BASHPID $LINES$SHELLOPTS > $SLURM_PTY_WIN_ROW > $BASH_ALIASES $MACHTYPE $SHLVL > $SLURM_SRUN_COMM_HOST > $BASH_ARGC$MAILCHECK$SLURMD_NODENAME > $SLURM_SRUN_COMM_PORT > $BASH_ARGV$MOAB_CLASS > $SLURM_CHECKPOINT_IMAGE_DIR $SLURM_STEPID > $BASH_CMDS$MOAB_GROUP $SLURM_CONF > $SLURM_STEP_ID > $BASH_COMMAND $MOAB_JOBID > $SLURM_CPUS_ON_NODE $SLURM_STEP_LAUNCHER_PORT > $BASH_LINENO $MOAB_NODECOUNT > $SLURM_DISTRIBUTION $SLURM_STEP_NODELIST > $BASH_SOURCE $MOAB_PARTITION $SLURM_GTIDS > $SLURM_STEP_NUM_NODES > $BASH_SUBSHELL$MOAB_PROCCOUNT $SLURM_JOBID > $SLURM_STEP_NUM_TASKS > $BASH_VERSINFO$MOAB_SUBMITDIR > $SLURM_JOB_CPUS_PER_NODE $SLURM_STEP_TASKS_PER_NODE > $BASH_VERSION $MOAB_USER$SLURM_JOB_ID > $SLURM_SUBMIT_DIR > $COLUMNS $OPTERR > $SLURM_JOB_NODELIST $SLURM_SUBMIT_HOST > $COMP_WORDBREAKS $OPTIND > $SLURM_JOB_NUM_NODES $SLURM_TASKS_PER_NODE > $DIRSTACK $OSTYPE > $SLURM_LAUNCH_NODE_IPADDR $SLURM_TASK_PID > $EUID $PATH $SLURM_LOCALID > $SLURM_TOPOLOGY_ADDR > $GROUPS $POSIXLY_CORRECT $SLURM_NNODES > $SLURM_TOPOLOGY_ADDR_PATTERN > $HISTCMD $PPID $SLURM_NODEID > $SRUN_DEBUG > $HISTFILE $PS1 $SLURM_NODELIST > $TERM > $HISTFILESIZE $PS2 $SLURM_NPROCS > $TMPDIR > $HISTSIZE $PS4 $SLURM_NTASKS > $UID > $HOSTNAME $PWD > $SLURM_PRIO_PROCESS $_ > $HOSTTYPE $RANDOM $SLURM_PROCID > > > > > On Wed, Feb 12, 2014 at 06:12:45AM -0800, Ralph Castain wrote: >> Seems rather odd - since this is managed by Moab, you shouldn't be seeing >> SLURM envars at all. What you should see are PBS_* envars, including a >> PBS_NODEFILE that actually contains the allocation. >> >> >> On Feb 12, 2014, at 4:42 AM, Adrian Reber wrote: >> >>> I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a system >>> with slurm and moab. I requested an interactive session using: >>> >>> msub -I -l nodes=3:ppn=8 >>> >>> and started a simple test case which fails: >>> >>> $ mpirun -np 2 ./mpi-test 1 >>> -- >>> There are not enough slots available in the system to satisfy the 2 slots >>> that were requested by the application: >>> ./mpi-test >>> >>> Either request fewer slots for your application, or make more slots >>> available >>> for use. >>> -- >>> srun: error: 108: task 1: Exited with exit code 1 >>> srun: Terminating job step 131823.4 >>> srun: error: 107: task 0: Exited with exit code 1 >>> srun: Job step aborted >>> slurmd[108]: *** STEP 131823.4 KILLED AT 2014-02-12T13:30:32 WITH >>> SIGNAL 9 *** >>> >>> >>> requesting only one core works: >>> >>> $ mpirun ./mpi-test 1 >>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 1: 0.00 >>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 1: 0.00 >>> >>> >>> using openmpi-1.6.5 works with multiple cores: >>> >>> $ mpirun -np 24 ./mpi-test 2 >>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 24: 0.00 >>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 12 on 106 out of 24: 12.00 >>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 11 on 108 out of 24: 11.00 >>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 18 on 106 out of 24: 18.00 >>> >>> $ echo $SLURM_JOB_CPUS_PER_NODE >>> 8(x3) >>> >>> I never used slurm before so this could also be a user error on my side. >>> But as 1.6.5 works it seems
Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems
No, the system has only a few MOAB_* variables and many SLURM_* variables: $BASH $IFS $SECONDS $SLURM_PTY_PORT $BASHOPTS $LINENO $SHELL $SLURM_PTY_WIN_COL $BASHPID $LINES$SHELLOPTS $SLURM_PTY_WIN_ROW $BASH_ALIASES $MACHTYPE $SHLVL $SLURM_SRUN_COMM_HOST $BASH_ARGC$MAILCHECK$SLURMD_NODENAME $SLURM_SRUN_COMM_PORT $BASH_ARGV$MOAB_CLASS $SLURM_CHECKPOINT_IMAGE_DIR $SLURM_STEPID $BASH_CMDS$MOAB_GROUP $SLURM_CONF $SLURM_STEP_ID $BASH_COMMAND $MOAB_JOBID $SLURM_CPUS_ON_NODE $SLURM_STEP_LAUNCHER_PORT $BASH_LINENO $MOAB_NODECOUNT $SLURM_DISTRIBUTION $SLURM_STEP_NODELIST $BASH_SOURCE $MOAB_PARTITION $SLURM_GTIDS $SLURM_STEP_NUM_NODES $BASH_SUBSHELL$MOAB_PROCCOUNT $SLURM_JOBID $SLURM_STEP_NUM_TASKS $BASH_VERSINFO$MOAB_SUBMITDIR $SLURM_JOB_CPUS_PER_NODE $SLURM_STEP_TASKS_PER_NODE $BASH_VERSION $MOAB_USER$SLURM_JOB_ID $SLURM_SUBMIT_DIR $COLUMNS $OPTERR $SLURM_JOB_NODELIST $SLURM_SUBMIT_HOST $COMP_WORDBREAKS $OPTIND $SLURM_JOB_NUM_NODES $SLURM_TASKS_PER_NODE $DIRSTACK $OSTYPE $SLURM_LAUNCH_NODE_IPADDR $SLURM_TASK_PID $EUID $PATH $SLURM_LOCALID $SLURM_TOPOLOGY_ADDR $GROUPS $POSIXLY_CORRECT $SLURM_NNODES $SLURM_TOPOLOGY_ADDR_PATTERN $HISTCMD $PPID $SLURM_NODEID $SRUN_DEBUG $HISTFILE $PS1 $SLURM_NODELIST $TERM $HISTFILESIZE $PS2 $SLURM_NPROCS $TMPDIR $HISTSIZE $PS4 $SLURM_NTASKS $UID $HOSTNAME $PWD $SLURM_PRIO_PROCESS $_ $HOSTTYPE $RANDOM $SLURM_PROCID On Wed, Feb 12, 2014 at 06:12:45AM -0800, Ralph Castain wrote: > Seems rather odd - since this is managed by Moab, you shouldn't be seeing > SLURM envars at all. What you should see are PBS_* envars, including a > PBS_NODEFILE that actually contains the allocation. > > > On Feb 12, 2014, at 4:42 AM, Adrian Reberwrote: > > > I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a system > > with slurm and moab. I requested an interactive session using: > > > > msub -I -l nodes=3:ppn=8 > > > > and started a simple test case which fails: > > > > $ mpirun -np 2 ./mpi-test 1 > > -- > > There are not enough slots available in the system to satisfy the 2 slots > > that were requested by the application: > > ./mpi-test > > > > Either request fewer slots for your application, or make more slots > > available > > for use. > > -- > > srun: error: 108: task 1: Exited with exit code 1 > > srun: Terminating job step 131823.4 > > srun: error: 107: task 0: Exited with exit code 1 > > srun: Job step aborted > > slurmd[108]: *** STEP 131823.4 KILLED AT 2014-02-12T13:30:32 WITH > > SIGNAL 9 *** > > > > > > requesting only one core works: > > > > $ mpirun ./mpi-test 1 > > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 1: 0.00 > > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 1: 0.00 > > > > > > using openmpi-1.6.5 works with multiple cores: > > > > $ mpirun -np 24 ./mpi-test 2 > > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 24: 0.00 > > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 12 on 106 out of 24: 12.00 > > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 11 on 108 out of 24: 11.00 > > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 18 on 106 out of 24: 18.00 > > > > $ echo $SLURM_JOB_CPUS_PER_NODE > > 8(x3) > > > > I never used slurm before so this could also be a user error on my side. > > But as 1.6.5 works it seems something has changed and wanted to let > > you know in case it was not intentionally. > > > > Adrian > > ___ > > devel mailing list > > de...@open-mpi.org > >
Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems
Seems rather odd - since this is managed by Moab, you shouldn't be seeing SLURM envars at all. What you should see are PBS_* envars, including a PBS_NODEFILE that actually contains the allocation. On Feb 12, 2014, at 4:42 AM, Adrian Reberwrote: > I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a system > with slurm and moab. I requested an interactive session using: > > msub -I -l nodes=3:ppn=8 > > and started a simple test case which fails: > > $ mpirun -np 2 ./mpi-test 1 > -- > There are not enough slots available in the system to satisfy the 2 slots > that were requested by the application: > ./mpi-test > > Either request fewer slots for your application, or make more slots available > for use. > -- > srun: error: 108: task 1: Exited with exit code 1 > srun: Terminating job step 131823.4 > srun: error: 107: task 0: Exited with exit code 1 > srun: Job step aborted > slurmd[108]: *** STEP 131823.4 KILLED AT 2014-02-12T13:30:32 WITH SIGNAL > 9 *** > > > requesting only one core works: > > $ mpirun ./mpi-test 1 > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 1: 0.00 > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 1: 0.00 > > > using openmpi-1.6.5 works with multiple cores: > > $ mpirun -np 24 ./mpi-test 2 > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 24: 0.00 > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 12 on 106 out of 24: 12.00 > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 11 on 108 out of 24: 11.00 > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 18 on 106 out of 24: 18.00 > > $ echo $SLURM_JOB_CPUS_PER_NODE > 8(x3) > > I never used slurm before so this could also be a user error on my side. > But as 1.6.5 works it seems something has changed and wanted to let > you know in case it was not intentionally. > > Adrian > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] openmpi-1.7.5a1r30692 and slurm problems
I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a system with slurm and moab. I requested an interactive session using: msub -I -l nodes=3:ppn=8 and started a simple test case which fails: $ mpirun -np 2 ./mpi-test 1 -- There are not enough slots available in the system to satisfy the 2 slots that were requested by the application: ./mpi-test Either request fewer slots for your application, or make more slots available for use. -- srun: error: 108: task 1: Exited with exit code 1 srun: Terminating job step 131823.4 srun: error: 107: task 0: Exited with exit code 1 srun: Job step aborted slurmd[108]: *** STEP 131823.4 KILLED AT 2014-02-12T13:30:32 WITH SIGNAL 9 *** requesting only one core works: $ mpirun ./mpi-test 1 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 1: 0.00 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 1: 0.00 using openmpi-1.6.5 works with multiple cores: $ mpirun -np 24 ./mpi-test 2 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 24: 0.00 4.4.7 20120313 (Red Hat 4.4.7-4):Process 12 on 106 out of 24: 12.00 4.4.7 20120313 (Red Hat 4.4.7-4):Process 11 on 108 out of 24: 11.00 4.4.7 20120313 (Red Hat 4.4.7-4):Process 18 on 106 out of 24: 18.00 $ echo $SLURM_JOB_CPUS_PER_NODE 8(x3) I never used slurm before so this could also be a user error on my side. But as 1.6.5 works it seems something has changed and wanted to let you know in case it was not intentionally. Adrian