Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Ralph Castain
Interesting - good to know. Thanks

On Feb 12, 2014, at 10:38 AM, Adrian Reber  wrote:

> It seems this is indeed a Moab bug for interactive jobs. At least a bug
> was opened against moab. Using non-interactive jobs the variables have
> the correct values and mpirun has no problems detecting the correct
> number of cores.
> 
> On Wed, Feb 12, 2014 at 07:50:40AM -0800, Ralph Castain wrote:
>> Another possibility to check - it is entirely possible that Moab is 
>> miscommunicating the values to Slurm. You might need to check it - I'll 
>> install a copy of 2.6.5 on my machines and see if I get similar issues when 
>> Slurm does the allocation itself.
>> 
>> On Feb 12, 2014, at 7:47 AM, Ralph Castain  wrote:
>> 
>>> 
>>> On Feb 12, 2014, at 7:32 AM, Adrian Reber  wrote:
>>> 
 
 $ msub -I -l nodes=3:ppn=8
 salloc: Job is in held state, pending scheduler release
 salloc: Pending job allocation 131828
 salloc: job 131828 queued and waiting for resources
 salloc: job 131828 has been allocated resources
 salloc: Granted job allocation 131828
 sh-4.1$ echo $SLURM_TASKS_PER_NODE 
 1
 sh-4.1$ rpm -q slurm
 slurm-2.6.5-1.el6.x86_64
 sh-4.1$ echo $SLURM_NNODES 
 1
 sh-4.1$ echo $SLURM_JOB_NODELIST 
 [107-108,176]
 sh-4.1$ echo $SLURM_JOB_CPUS_PER_NODE 
 8(x3)
 sh-4.1$ echo $SLURM_NODELIST 
 [107-108,176]
 sh-4.1$ echo $SLURM_NPROCS  
 1
 sh-4.1$ echo $SLURM_NTASKS 
 1
 sh-4.1$ echo $SLURM_TASKS_PER_NODE 
 1
 
 The information in *_NODELIST seems to make sense, but all the other
 variables (PROCS, TASKS, NODES) report '1', which seems wrong.
>>> 
>>> Indeed - and that's the problem. Slurm 2.6.5 is the most recent release, 
>>> and my guess is that SchedMD once again has changed the @$!#%#@ meaning of 
>>> their envars. Frankly, it is nearly impossible to track all the variants 
>>> they have created over the years.
>>> 
>>> Please check to see if someone did a little customizing on your end as 
>>> sometimes people do that to Slurm. Could also be they did something in the 
>>> Slurm config file that is causing the changed behavior.
>>> 
>>> Meantime, I'll try to ponder a potential solution in case this really is 
>>> the "latest" Slurm screwup.
>>> 
>>> 
 
 
 On Wed, Feb 12, 2014 at 07:19:54AM -0800, Ralph Castain wrote:
> ...and your version of Slurm?
> 
> On Feb 12, 2014, at 7:19 AM, Ralph Castain  wrote:
> 
>> What is your SLURM_TASKS_PER_NODE?
>> 
>> On Feb 12, 2014, at 6:58 AM, Adrian Reber  wrote:
>> 
>>> No, the system has only a few MOAB_* variables and many SLURM_*
>>> variables:
>>> 
>>> $BASH $IFS  $SECONDS
>>>   $SLURM_PTY_PORT
>>> $BASHOPTS $LINENO   $SHELL  
>>>   $SLURM_PTY_WIN_COL
>>> $BASHPID  $LINES$SHELLOPTS  
>>>   $SLURM_PTY_WIN_ROW
>>> $BASH_ALIASES $MACHTYPE $SHLVL  
>>>   $SLURM_SRUN_COMM_HOST
>>> $BASH_ARGC$MAILCHECK
>>> $SLURMD_NODENAME  $SLURM_SRUN_COMM_PORT
>>> $BASH_ARGV$MOAB_CLASS   
>>> $SLURM_CHECKPOINT_IMAGE_DIR   $SLURM_STEPID
>>> $BASH_CMDS$MOAB_GROUP   $SLURM_CONF 
>>>   $SLURM_STEP_ID
>>> $BASH_COMMAND $MOAB_JOBID   
>>> $SLURM_CPUS_ON_NODE   $SLURM_STEP_LAUNCHER_PORT
>>> $BASH_LINENO  $MOAB_NODECOUNT   
>>> $SLURM_DISTRIBUTION   $SLURM_STEP_NODELIST
>>> $BASH_SOURCE  $MOAB_PARTITION   
>>> $SLURM_GTIDS  $SLURM_STEP_NUM_NODES
>>> $BASH_SUBSHELL$MOAB_PROCCOUNT   
>>> $SLURM_JOBID  $SLURM_STEP_NUM_TASKS
>>> $BASH_VERSINFO$MOAB_SUBMITDIR   
>>> $SLURM_JOB_CPUS_PER_NODE  $SLURM_STEP_TASKS_PER_NODE
>>> $BASH_VERSION $MOAB_USER
>>> $SLURM_JOB_ID $SLURM_SUBMIT_DIR
>>> $COLUMNS  $OPTERR   
>>> $SLURM_JOB_NODELIST   $SLURM_SUBMIT_HOST
>>> $COMP_WORDBREAKS  $OPTIND   
>>> $SLURM_JOB_NUM_NODES  $SLURM_TASKS_PER_NODE
>>> $DIRSTACK $OSTYPE   
>>> $SLURM_LAUNCH_NODE_IPADDR $SLURM_TASK_PID
>>> $EUID $PATH 
>>> $SLURM_LOCALID$SLURM_TOPOLOGY_ADDR
>>> $GROUPS   

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber
It seems this is indeed a Moab bug for interactive jobs. At least a bug
was opened against moab. Using non-interactive jobs the variables have
the correct values and mpirun has no problems detecting the correct
number of cores.

On Wed, Feb 12, 2014 at 07:50:40AM -0800, Ralph Castain wrote:
> Another possibility to check - it is entirely possible that Moab is 
> miscommunicating the values to Slurm. You might need to check it - I'll 
> install a copy of 2.6.5 on my machines and see if I get similar issues when 
> Slurm does the allocation itself.
> 
> On Feb 12, 2014, at 7:47 AM, Ralph Castain  wrote:
> 
> > 
> > On Feb 12, 2014, at 7:32 AM, Adrian Reber  wrote:
> > 
> >> 
> >> $ msub -I -l nodes=3:ppn=8
> >> salloc: Job is in held state, pending scheduler release
> >> salloc: Pending job allocation 131828
> >> salloc: job 131828 queued and waiting for resources
> >> salloc: job 131828 has been allocated resources
> >> salloc: Granted job allocation 131828
> >> sh-4.1$ echo $SLURM_TASKS_PER_NODE 
> >> 1
> >> sh-4.1$ rpm -q slurm
> >> slurm-2.6.5-1.el6.x86_64
> >> sh-4.1$ echo $SLURM_NNODES 
> >> 1
> >> sh-4.1$ echo $SLURM_JOB_NODELIST 
> >> [107-108,176]
> >> sh-4.1$ echo $SLURM_JOB_CPUS_PER_NODE 
> >> 8(x3)
> >> sh-4.1$ echo $SLURM_NODELIST 
> >> [107-108,176]
> >> sh-4.1$ echo $SLURM_NPROCS  
> >> 1
> >> sh-4.1$ echo $SLURM_NTASKS 
> >> 1
> >> sh-4.1$ echo $SLURM_TASKS_PER_NODE 
> >> 1
> >> 
> >> The information in *_NODELIST seems to make sense, but all the other
> >> variables (PROCS, TASKS, NODES) report '1', which seems wrong.
> > 
> > Indeed - and that's the problem. Slurm 2.6.5 is the most recent release, 
> > and my guess is that SchedMD once again has changed the @$!#%#@ meaning of 
> > their envars. Frankly, it is nearly impossible to track all the variants 
> > they have created over the years.
> > 
> > Please check to see if someone did a little customizing on your end as 
> > sometimes people do that to Slurm. Could also be they did something in the 
> > Slurm config file that is causing the changed behavior.
> > 
> > Meantime, I'll try to ponder a potential solution in case this really is 
> > the "latest" Slurm screwup.
> > 
> > 
> >> 
> >> 
> >> On Wed, Feb 12, 2014 at 07:19:54AM -0800, Ralph Castain wrote:
> >>> ...and your version of Slurm?
> >>> 
> >>> On Feb 12, 2014, at 7:19 AM, Ralph Castain  wrote:
> >>> 
>  What is your SLURM_TASKS_PER_NODE?
>  
>  On Feb 12, 2014, at 6:58 AM, Adrian Reber  wrote:
>  
> > No, the system has only a few MOAB_* variables and many SLURM_*
> > variables:
> > 
> > $BASH $IFS  $SECONDS
> >   $SLURM_PTY_PORT
> > $BASHOPTS $LINENO   $SHELL  
> >   $SLURM_PTY_WIN_COL
> > $BASHPID  $LINES$SHELLOPTS  
> >   $SLURM_PTY_WIN_ROW
> > $BASH_ALIASES $MACHTYPE $SHLVL  
> >   $SLURM_SRUN_COMM_HOST
> > $BASH_ARGC$MAILCHECK
> > $SLURMD_NODENAME  $SLURM_SRUN_COMM_PORT
> > $BASH_ARGV$MOAB_CLASS   
> > $SLURM_CHECKPOINT_IMAGE_DIR   $SLURM_STEPID
> > $BASH_CMDS$MOAB_GROUP   $SLURM_CONF 
> >   $SLURM_STEP_ID
> > $BASH_COMMAND $MOAB_JOBID   
> > $SLURM_CPUS_ON_NODE   $SLURM_STEP_LAUNCHER_PORT
> > $BASH_LINENO  $MOAB_NODECOUNT   
> > $SLURM_DISTRIBUTION   $SLURM_STEP_NODELIST
> > $BASH_SOURCE  $MOAB_PARTITION   
> > $SLURM_GTIDS  $SLURM_STEP_NUM_NODES
> > $BASH_SUBSHELL$MOAB_PROCCOUNT   
> > $SLURM_JOBID  $SLURM_STEP_NUM_TASKS
> > $BASH_VERSINFO$MOAB_SUBMITDIR   
> > $SLURM_JOB_CPUS_PER_NODE  $SLURM_STEP_TASKS_PER_NODE
> > $BASH_VERSION $MOAB_USER
> > $SLURM_JOB_ID $SLURM_SUBMIT_DIR
> > $COLUMNS  $OPTERR   
> > $SLURM_JOB_NODELIST   $SLURM_SUBMIT_HOST
> > $COMP_WORDBREAKS  $OPTIND   
> > $SLURM_JOB_NUM_NODES  $SLURM_TASKS_PER_NODE
> > $DIRSTACK $OSTYPE   
> > $SLURM_LAUNCH_NODE_IPADDR $SLURM_TASK_PID
> > $EUID $PATH 
> > $SLURM_LOCALID$SLURM_TOPOLOGY_ADDR
> > $GROUPS   $POSIXLY_CORRECT  
> > $SLURM_NNODES $SLURM_TOPOLOGY_ADDR_PATTERN
> > $HISTCMD 

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Ralph Castain
Another possibility to check - it is entirely possible that Moab is 
miscommunicating the values to Slurm. You might need to check it - I'll install 
a copy of 2.6.5 on my machines and see if I get similar issues when Slurm does 
the allocation itself.

On Feb 12, 2014, at 7:47 AM, Ralph Castain  wrote:

> 
> On Feb 12, 2014, at 7:32 AM, Adrian Reber  wrote:
> 
>> 
>> $ msub -I -l nodes=3:ppn=8
>> salloc: Job is in held state, pending scheduler release
>> salloc: Pending job allocation 131828
>> salloc: job 131828 queued and waiting for resources
>> salloc: job 131828 has been allocated resources
>> salloc: Granted job allocation 131828
>> sh-4.1$ echo $SLURM_TASKS_PER_NODE 
>> 1
>> sh-4.1$ rpm -q slurm
>> slurm-2.6.5-1.el6.x86_64
>> sh-4.1$ echo $SLURM_NNODES 
>> 1
>> sh-4.1$ echo $SLURM_JOB_NODELIST 
>> [107-108,176]
>> sh-4.1$ echo $SLURM_JOB_CPUS_PER_NODE 
>> 8(x3)
>> sh-4.1$ echo $SLURM_NODELIST 
>> [107-108,176]
>> sh-4.1$ echo $SLURM_NPROCS  
>> 1
>> sh-4.1$ echo $SLURM_NTASKS 
>> 1
>> sh-4.1$ echo $SLURM_TASKS_PER_NODE 
>> 1
>> 
>> The information in *_NODELIST seems to make sense, but all the other
>> variables (PROCS, TASKS, NODES) report '1', which seems wrong.
> 
> Indeed - and that's the problem. Slurm 2.6.5 is the most recent release, and 
> my guess is that SchedMD once again has changed the @$!#%#@ meaning of their 
> envars. Frankly, it is nearly impossible to track all the variants they have 
> created over the years.
> 
> Please check to see if someone did a little customizing on your end as 
> sometimes people do that to Slurm. Could also be they did something in the 
> Slurm config file that is causing the changed behavior.
> 
> Meantime, I'll try to ponder a potential solution in case this really is the 
> "latest" Slurm screwup.
> 
> 
>> 
>> 
>> On Wed, Feb 12, 2014 at 07:19:54AM -0800, Ralph Castain wrote:
>>> ...and your version of Slurm?
>>> 
>>> On Feb 12, 2014, at 7:19 AM, Ralph Castain  wrote:
>>> 
 What is your SLURM_TASKS_PER_NODE?
 
 On Feb 12, 2014, at 6:58 AM, Adrian Reber  wrote:
 
> No, the system has only a few MOAB_* variables and many SLURM_*
> variables:
> 
> $BASH $IFS  $SECONDS  
> $SLURM_PTY_PORT
> $BASHOPTS $LINENO   $SHELL
> $SLURM_PTY_WIN_COL
> $BASHPID  $LINES$SHELLOPTS
> $SLURM_PTY_WIN_ROW
> $BASH_ALIASES $MACHTYPE $SHLVL
> $SLURM_SRUN_COMM_HOST
> $BASH_ARGC$MAILCHECK
> $SLURMD_NODENAME  $SLURM_SRUN_COMM_PORT
> $BASH_ARGV$MOAB_CLASS   
> $SLURM_CHECKPOINT_IMAGE_DIR   $SLURM_STEPID
> $BASH_CMDS$MOAB_GROUP   $SLURM_CONF   
> $SLURM_STEP_ID
> $BASH_COMMAND $MOAB_JOBID   
> $SLURM_CPUS_ON_NODE   $SLURM_STEP_LAUNCHER_PORT
> $BASH_LINENO  $MOAB_NODECOUNT   
> $SLURM_DISTRIBUTION   $SLURM_STEP_NODELIST
> $BASH_SOURCE  $MOAB_PARTITION   $SLURM_GTIDS  
> $SLURM_STEP_NUM_NODES
> $BASH_SUBSHELL$MOAB_PROCCOUNT   $SLURM_JOBID  
> $SLURM_STEP_NUM_TASKS
> $BASH_VERSINFO$MOAB_SUBMITDIR   
> $SLURM_JOB_CPUS_PER_NODE  $SLURM_STEP_TASKS_PER_NODE
> $BASH_VERSION $MOAB_USER$SLURM_JOB_ID 
> $SLURM_SUBMIT_DIR
> $COLUMNS  $OPTERR   
> $SLURM_JOB_NODELIST   $SLURM_SUBMIT_HOST
> $COMP_WORDBREAKS  $OPTIND   
> $SLURM_JOB_NUM_NODES  $SLURM_TASKS_PER_NODE
> $DIRSTACK $OSTYPE   
> $SLURM_LAUNCH_NODE_IPADDR $SLURM_TASK_PID
> $EUID $PATH 
> $SLURM_LOCALID$SLURM_TOPOLOGY_ADDR
> $GROUPS   $POSIXLY_CORRECT  $SLURM_NNODES 
> $SLURM_TOPOLOGY_ADDR_PATTERN
> $HISTCMD  $PPID $SLURM_NODEID 
> $SRUN_DEBUG
> $HISTFILE $PS1  
> $SLURM_NODELIST   $TERM
> $HISTFILESIZE $PS2  $SLURM_NPROCS 
> $TMPDIR
> $HISTSIZE $PS4  $SLURM_NTASKS 
> $UID
> $HOSTNAME $PWD  
> 

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Ralph Castain

On Feb 12, 2014, at 7:32 AM, Adrian Reber  wrote:

> 
> $ msub -I -l nodes=3:ppn=8
> salloc: Job is in held state, pending scheduler release
> salloc: Pending job allocation 131828
> salloc: job 131828 queued and waiting for resources
> salloc: job 131828 has been allocated resources
> salloc: Granted job allocation 131828
> sh-4.1$ echo $SLURM_TASKS_PER_NODE 
> 1
> sh-4.1$ rpm -q slurm
> slurm-2.6.5-1.el6.x86_64
> sh-4.1$ echo $SLURM_NNODES 
> 1
> sh-4.1$ echo $SLURM_JOB_NODELIST 
> [107-108,176]
> sh-4.1$ echo $SLURM_JOB_CPUS_PER_NODE 
> 8(x3)
> sh-4.1$ echo $SLURM_NODELIST 
> [107-108,176]
> sh-4.1$ echo $SLURM_NPROCS  
> 1
> sh-4.1$ echo $SLURM_NTASKS 
> 1
> sh-4.1$ echo $SLURM_TASKS_PER_NODE 
> 1
> 
> The information in *_NODELIST seems to make sense, but all the other
> variables (PROCS, TASKS, NODES) report '1', which seems wrong.

Indeed - and that's the problem. Slurm 2.6.5 is the most recent release, and my 
guess is that SchedMD once again has changed the @$!#%#@ meaning of their 
envars. Frankly, it is nearly impossible to track all the variants they have 
created over the years.

Please check to see if someone did a little customizing on your end as 
sometimes people do that to Slurm. Could also be they did something in the 
Slurm config file that is causing the changed behavior.

Meantime, I'll try to ponder a potential solution in case this really is the 
"latest" Slurm screwup.


> 
> 
> On Wed, Feb 12, 2014 at 07:19:54AM -0800, Ralph Castain wrote:
>> ...and your version of Slurm?
>> 
>> On Feb 12, 2014, at 7:19 AM, Ralph Castain  wrote:
>> 
>>> What is your SLURM_TASKS_PER_NODE?
>>> 
>>> On Feb 12, 2014, at 6:58 AM, Adrian Reber  wrote:
>>> 
 No, the system has only a few MOAB_* variables and many SLURM_*
 variables:
 
 $BASH $IFS  $SECONDS   
$SLURM_PTY_PORT
 $BASHOPTS $LINENO   $SHELL 
$SLURM_PTY_WIN_COL
 $BASHPID  $LINES$SHELLOPTS 
$SLURM_PTY_WIN_ROW
 $BASH_ALIASES $MACHTYPE $SHLVL 
$SLURM_SRUN_COMM_HOST
 $BASH_ARGC$MAILCHECK
 $SLURMD_NODENAME  $SLURM_SRUN_COMM_PORT
 $BASH_ARGV$MOAB_CLASS   
 $SLURM_CHECKPOINT_IMAGE_DIR   $SLURM_STEPID
 $BASH_CMDS$MOAB_GROUP   $SLURM_CONF
$SLURM_STEP_ID
 $BASH_COMMAND $MOAB_JOBID   
 $SLURM_CPUS_ON_NODE   $SLURM_STEP_LAUNCHER_PORT
 $BASH_LINENO  $MOAB_NODECOUNT   
 $SLURM_DISTRIBUTION   $SLURM_STEP_NODELIST
 $BASH_SOURCE  $MOAB_PARTITION   $SLURM_GTIDS   
$SLURM_STEP_NUM_NODES
 $BASH_SUBSHELL$MOAB_PROCCOUNT   $SLURM_JOBID   
$SLURM_STEP_NUM_TASKS
 $BASH_VERSINFO$MOAB_SUBMITDIR   
 $SLURM_JOB_CPUS_PER_NODE  $SLURM_STEP_TASKS_PER_NODE
 $BASH_VERSION $MOAB_USER$SLURM_JOB_ID  
$SLURM_SUBMIT_DIR
 $COLUMNS  $OPTERR   
 $SLURM_JOB_NODELIST   $SLURM_SUBMIT_HOST
 $COMP_WORDBREAKS  $OPTIND   
 $SLURM_JOB_NUM_NODES  $SLURM_TASKS_PER_NODE
 $DIRSTACK $OSTYPE   
 $SLURM_LAUNCH_NODE_IPADDR $SLURM_TASK_PID
 $EUID $PATH $SLURM_LOCALID 
$SLURM_TOPOLOGY_ADDR
 $GROUPS   $POSIXLY_CORRECT  $SLURM_NNODES  
$SLURM_TOPOLOGY_ADDR_PATTERN
 $HISTCMD  $PPID $SLURM_NODEID  
$SRUN_DEBUG
 $HISTFILE $PS1  
 $SLURM_NODELIST   $TERM
 $HISTFILESIZE $PS2  $SLURM_NPROCS  
$TMPDIR
 $HISTSIZE $PS4  $SLURM_NTASKS  
$UID
 $HOSTNAME $PWD  
 $SLURM_PRIO_PROCESS   $_
 $HOSTTYPE $RANDOM   $SLURM_PROCID  

 
 
 
 On Wed, Feb 12, 2014 at 06:12:45AM -0800, Ralph Castain wrote:
> Seems rather odd - since this is managed by Moab, you shouldn't be seeing 
> SLURM envars at all. What you should see are PBS_* envars, including a 
> PBS_NODEFILE that actually contains the allocation.
> 
> 
> On 

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber

$ msub -I -l nodes=3:ppn=8
salloc: Job is in held state, pending scheduler release
salloc: Pending job allocation 131828
salloc: job 131828 queued and waiting for resources
salloc: job 131828 has been allocated resources
salloc: Granted job allocation 131828
sh-4.1$ echo $SLURM_TASKS_PER_NODE 
1
sh-4.1$ rpm -q slurm
slurm-2.6.5-1.el6.x86_64
sh-4.1$ echo $SLURM_NNODES 
1
sh-4.1$ echo $SLURM_JOB_NODELIST 
[107-108,176]
sh-4.1$ echo $SLURM_JOB_CPUS_PER_NODE 
8(x3)
sh-4.1$ echo $SLURM_NODELIST 
[107-108,176]
sh-4.1$ echo $SLURM_NPROCS  
1
sh-4.1$ echo $SLURM_NTASKS 
1
sh-4.1$ echo $SLURM_TASKS_PER_NODE 
1

The information in *_NODELIST seems to make sense, but all the other
variables (PROCS, TASKS, NODES) report '1', which seems wrong.


On Wed, Feb 12, 2014 at 07:19:54AM -0800, Ralph Castain wrote:
> ...and your version of Slurm?
> 
> On Feb 12, 2014, at 7:19 AM, Ralph Castain  wrote:
> 
> > What is your SLURM_TASKS_PER_NODE?
> > 
> > On Feb 12, 2014, at 6:58 AM, Adrian Reber  wrote:
> > 
> >> No, the system has only a few MOAB_* variables and many SLURM_*
> >> variables:
> >> 
> >> $BASH $IFS  $SECONDS   
> >>$SLURM_PTY_PORT
> >> $BASHOPTS $LINENO   $SHELL 
> >>$SLURM_PTY_WIN_COL
> >> $BASHPID  $LINES$SHELLOPTS 
> >>$SLURM_PTY_WIN_ROW
> >> $BASH_ALIASES $MACHTYPE $SHLVL 
> >>$SLURM_SRUN_COMM_HOST
> >> $BASH_ARGC$MAILCHECK
> >> $SLURMD_NODENAME  $SLURM_SRUN_COMM_PORT
> >> $BASH_ARGV$MOAB_CLASS   
> >> $SLURM_CHECKPOINT_IMAGE_DIR   $SLURM_STEPID
> >> $BASH_CMDS$MOAB_GROUP   $SLURM_CONF
> >>$SLURM_STEP_ID
> >> $BASH_COMMAND $MOAB_JOBID   
> >> $SLURM_CPUS_ON_NODE   $SLURM_STEP_LAUNCHER_PORT
> >> $BASH_LINENO  $MOAB_NODECOUNT   
> >> $SLURM_DISTRIBUTION   $SLURM_STEP_NODELIST
> >> $BASH_SOURCE  $MOAB_PARTITION   $SLURM_GTIDS   
> >>$SLURM_STEP_NUM_NODES
> >> $BASH_SUBSHELL$MOAB_PROCCOUNT   $SLURM_JOBID   
> >>$SLURM_STEP_NUM_TASKS
> >> $BASH_VERSINFO$MOAB_SUBMITDIR   
> >> $SLURM_JOB_CPUS_PER_NODE  $SLURM_STEP_TASKS_PER_NODE
> >> $BASH_VERSION $MOAB_USER$SLURM_JOB_ID  
> >>$SLURM_SUBMIT_DIR
> >> $COLUMNS  $OPTERR   
> >> $SLURM_JOB_NODELIST   $SLURM_SUBMIT_HOST
> >> $COMP_WORDBREAKS  $OPTIND   
> >> $SLURM_JOB_NUM_NODES  $SLURM_TASKS_PER_NODE
> >> $DIRSTACK $OSTYPE   
> >> $SLURM_LAUNCH_NODE_IPADDR $SLURM_TASK_PID
> >> $EUID $PATH $SLURM_LOCALID 
> >>$SLURM_TOPOLOGY_ADDR
> >> $GROUPS   $POSIXLY_CORRECT  $SLURM_NNODES  
> >>$SLURM_TOPOLOGY_ADDR_PATTERN
> >> $HISTCMD  $PPID $SLURM_NODEID  
> >>$SRUN_DEBUG
> >> $HISTFILE $PS1  
> >> $SLURM_NODELIST   $TERM
> >> $HISTFILESIZE $PS2  $SLURM_NPROCS  
> >>$TMPDIR
> >> $HISTSIZE $PS4  $SLURM_NTASKS  
> >>$UID
> >> $HOSTNAME $PWD  
> >> $SLURM_PRIO_PROCESS   $_
> >> $HOSTTYPE $RANDOM   $SLURM_PROCID  
> >>
> >> 
> >> 
> >> 
> >> On Wed, Feb 12, 2014 at 06:12:45AM -0800, Ralph Castain wrote:
> >>> Seems rather odd - since this is managed by Moab, you shouldn't be seeing 
> >>> SLURM envars at all. What you should see are PBS_* envars, including a 
> >>> PBS_NODEFILE that actually contains the allocation.
> >>> 
> >>> 
> >>> On Feb 12, 2014, at 4:42 AM, Adrian Reber  wrote:
> >>> 
>  I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a system
>  with slurm and moab. I requested an interactive session using:
>  
>  msub -I -l nodes=3:ppn=8
>  
>  and started a simple test case which fails:
>  
>  $ mpirun -np 2 ./mpi-test 1
>  --
>  There are not enough slots available in the system to satisfy the 2 
>  slots 
>  that were requested by the application:
>  ./mpi-test
>  
>  Either request fewer slots for your application, or make more slots 
>  available
>  

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Ralph Castain
What is your SLURM_TASKS_PER_NODE?

On Feb 12, 2014, at 6:58 AM, Adrian Reber  wrote:

> No, the system has only a few MOAB_* variables and many SLURM_*
> variables:
> 
> $BASH $IFS  $SECONDS  
> $SLURM_PTY_PORT
> $BASHOPTS $LINENO   $SHELL
> $SLURM_PTY_WIN_COL
> $BASHPID  $LINES$SHELLOPTS
> $SLURM_PTY_WIN_ROW
> $BASH_ALIASES $MACHTYPE $SHLVL
> $SLURM_SRUN_COMM_HOST
> $BASH_ARGC$MAILCHECK$SLURMD_NODENAME  
> $SLURM_SRUN_COMM_PORT
> $BASH_ARGV$MOAB_CLASS   
> $SLURM_CHECKPOINT_IMAGE_DIR   $SLURM_STEPID
> $BASH_CMDS$MOAB_GROUP   $SLURM_CONF   
> $SLURM_STEP_ID
> $BASH_COMMAND $MOAB_JOBID   
> $SLURM_CPUS_ON_NODE   $SLURM_STEP_LAUNCHER_PORT
> $BASH_LINENO  $MOAB_NODECOUNT   
> $SLURM_DISTRIBUTION   $SLURM_STEP_NODELIST
> $BASH_SOURCE  $MOAB_PARTITION   $SLURM_GTIDS  
> $SLURM_STEP_NUM_NODES
> $BASH_SUBSHELL$MOAB_PROCCOUNT   $SLURM_JOBID  
> $SLURM_STEP_NUM_TASKS
> $BASH_VERSINFO$MOAB_SUBMITDIR   
> $SLURM_JOB_CPUS_PER_NODE  $SLURM_STEP_TASKS_PER_NODE
> $BASH_VERSION $MOAB_USER$SLURM_JOB_ID 
> $SLURM_SUBMIT_DIR
> $COLUMNS  $OPTERR   
> $SLURM_JOB_NODELIST   $SLURM_SUBMIT_HOST
> $COMP_WORDBREAKS  $OPTIND   
> $SLURM_JOB_NUM_NODES  $SLURM_TASKS_PER_NODE
> $DIRSTACK $OSTYPE   
> $SLURM_LAUNCH_NODE_IPADDR $SLURM_TASK_PID
> $EUID $PATH $SLURM_LOCALID
> $SLURM_TOPOLOGY_ADDR
> $GROUPS   $POSIXLY_CORRECT  $SLURM_NNODES 
> $SLURM_TOPOLOGY_ADDR_PATTERN
> $HISTCMD  $PPID $SLURM_NODEID 
> $SRUN_DEBUG
> $HISTFILE $PS1  $SLURM_NODELIST   
> $TERM
> $HISTFILESIZE $PS2  $SLURM_NPROCS 
> $TMPDIR
> $HISTSIZE $PS4  $SLURM_NTASKS 
> $UID
> $HOSTNAME $PWD  
> $SLURM_PRIO_PROCESS   $_
> $HOSTTYPE $RANDOM   $SLURM_PROCID 
> 
> 
> 
> 
> On Wed, Feb 12, 2014 at 06:12:45AM -0800, Ralph Castain wrote:
>> Seems rather odd - since this is managed by Moab, you shouldn't be seeing 
>> SLURM envars at all. What you should see are PBS_* envars, including a 
>> PBS_NODEFILE that actually contains the allocation.
>> 
>> 
>> On Feb 12, 2014, at 4:42 AM, Adrian Reber  wrote:
>> 
>>> I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a system
>>> with slurm and moab. I requested an interactive session using:
>>> 
>>> msub -I -l nodes=3:ppn=8
>>> 
>>> and started a simple test case which fails:
>>> 
>>> $ mpirun -np 2 ./mpi-test 1
>>> --
>>> There are not enough slots available in the system to satisfy the 2 slots 
>>> that were requested by the application:
>>> ./mpi-test
>>> 
>>> Either request fewer slots for your application, or make more slots 
>>> available
>>> for use.
>>> --
>>> srun: error: 108: task 1: Exited with exit code 1
>>> srun: Terminating job step 131823.4
>>> srun: error: 107: task 0: Exited with exit code 1
>>> srun: Job step aborted
>>> slurmd[108]: *** STEP 131823.4 KILLED AT 2014-02-12T13:30:32 WITH 
>>> SIGNAL 9 ***
>>> 
>>> 
>>> requesting only one core works:
>>> 
>>> $ mpirun  ./mpi-test 1
>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 1: 0.00
>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 1: 0.00
>>> 
>>> 
>>> using openmpi-1.6.5 works with multiple cores:
>>> 
>>> $ mpirun -np 24 ./mpi-test 2
>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 24: 0.00
>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 12 on 106 out of 24: 12.00
>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 11 on 108 out of 24: 11.00
>>> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 18 on 106 out of 24: 18.00
>>> 
>>> $ echo $SLURM_JOB_CPUS_PER_NODE 
>>> 8(x3)
>>> 
>>> I never used slurm before so this could also be a user error on my side.
>>> But as 1.6.5 works it seems 

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber
No, the system has only a few MOAB_* variables and many SLURM_*
variables:

$BASH $IFS  $SECONDS
  $SLURM_PTY_PORT
$BASHOPTS $LINENO   $SHELL  
  $SLURM_PTY_WIN_COL
$BASHPID  $LINES$SHELLOPTS  
  $SLURM_PTY_WIN_ROW
$BASH_ALIASES $MACHTYPE $SHLVL  
  $SLURM_SRUN_COMM_HOST
$BASH_ARGC$MAILCHECK$SLURMD_NODENAME
  $SLURM_SRUN_COMM_PORT
$BASH_ARGV$MOAB_CLASS   
$SLURM_CHECKPOINT_IMAGE_DIR   $SLURM_STEPID
$BASH_CMDS$MOAB_GROUP   $SLURM_CONF 
  $SLURM_STEP_ID
$BASH_COMMAND $MOAB_JOBID   $SLURM_CPUS_ON_NODE 
  $SLURM_STEP_LAUNCHER_PORT
$BASH_LINENO  $MOAB_NODECOUNT   $SLURM_DISTRIBUTION 
  $SLURM_STEP_NODELIST
$BASH_SOURCE  $MOAB_PARTITION   $SLURM_GTIDS
  $SLURM_STEP_NUM_NODES
$BASH_SUBSHELL$MOAB_PROCCOUNT   $SLURM_JOBID
  $SLURM_STEP_NUM_TASKS
$BASH_VERSINFO$MOAB_SUBMITDIR   
$SLURM_JOB_CPUS_PER_NODE  $SLURM_STEP_TASKS_PER_NODE
$BASH_VERSION $MOAB_USER$SLURM_JOB_ID   
  $SLURM_SUBMIT_DIR
$COLUMNS  $OPTERR   $SLURM_JOB_NODELIST 
  $SLURM_SUBMIT_HOST
$COMP_WORDBREAKS  $OPTIND   
$SLURM_JOB_NUM_NODES  $SLURM_TASKS_PER_NODE
$DIRSTACK $OSTYPE   
$SLURM_LAUNCH_NODE_IPADDR $SLURM_TASK_PID
$EUID $PATH $SLURM_LOCALID  
  $SLURM_TOPOLOGY_ADDR
$GROUPS   $POSIXLY_CORRECT  $SLURM_NNODES   
  $SLURM_TOPOLOGY_ADDR_PATTERN
$HISTCMD  $PPID $SLURM_NODEID   
  $SRUN_DEBUG
$HISTFILE $PS1  $SLURM_NODELIST 
  $TERM
$HISTFILESIZE $PS2  $SLURM_NPROCS   
  $TMPDIR
$HISTSIZE $PS4  $SLURM_NTASKS   
  $UID
$HOSTNAME $PWD  $SLURM_PRIO_PROCESS 
  $_
$HOSTTYPE $RANDOM   $SLURM_PROCID   
  



On Wed, Feb 12, 2014 at 06:12:45AM -0800, Ralph Castain wrote:
> Seems rather odd - since this is managed by Moab, you shouldn't be seeing 
> SLURM envars at all. What you should see are PBS_* envars, including a 
> PBS_NODEFILE that actually contains the allocation.
> 
> 
> On Feb 12, 2014, at 4:42 AM, Adrian Reber  wrote:
> 
> > I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a system
> > with slurm and moab. I requested an interactive session using:
> > 
> > msub -I -l nodes=3:ppn=8
> > 
> > and started a simple test case which fails:
> > 
> > $ mpirun -np 2 ./mpi-test 1
> > --
> > There are not enough slots available in the system to satisfy the 2 slots 
> > that were requested by the application:
> >  ./mpi-test
> > 
> > Either request fewer slots for your application, or make more slots 
> > available
> > for use.
> > --
> > srun: error: 108: task 1: Exited with exit code 1
> > srun: Terminating job step 131823.4
> > srun: error: 107: task 0: Exited with exit code 1
> > srun: Job step aborted
> > slurmd[108]: *** STEP 131823.4 KILLED AT 2014-02-12T13:30:32 WITH 
> > SIGNAL 9 ***
> > 
> > 
> > requesting only one core works:
> > 
> > $ mpirun  ./mpi-test 1
> > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 1: 0.00
> > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 1: 0.00
> > 
> > 
> > using openmpi-1.6.5 works with multiple cores:
> > 
> > $ mpirun -np 24 ./mpi-test 2
> > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 24: 0.00
> > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 12 on 106 out of 24: 12.00
> > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 11 on 108 out of 24: 11.00
> > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 18 on 106 out of 24: 18.00
> > 
> > $ echo $SLURM_JOB_CPUS_PER_NODE 
> > 8(x3)
> > 
> > I never used slurm before so this could also be a user error on my side.
> > But as 1.6.5 works it seems something has changed and wanted to let
> > you know in case it was not intentionally.
> > 
> > Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > 

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Ralph Castain
Seems rather odd - since this is managed by Moab, you shouldn't be seeing SLURM 
envars at all. What you should see are PBS_* envars, including a PBS_NODEFILE 
that actually contains the allocation.


On Feb 12, 2014, at 4:42 AM, Adrian Reber  wrote:

> I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a system
> with slurm and moab. I requested an interactive session using:
> 
> msub -I -l nodes=3:ppn=8
> 
> and started a simple test case which fails:
> 
> $ mpirun -np 2 ./mpi-test 1
> --
> There are not enough slots available in the system to satisfy the 2 slots 
> that were requested by the application:
>  ./mpi-test
> 
> Either request fewer slots for your application, or make more slots available
> for use.
> --
> srun: error: 108: task 1: Exited with exit code 1
> srun: Terminating job step 131823.4
> srun: error: 107: task 0: Exited with exit code 1
> srun: Job step aborted
> slurmd[108]: *** STEP 131823.4 KILLED AT 2014-02-12T13:30:32 WITH SIGNAL 
> 9 ***
> 
> 
> requesting only one core works:
> 
> $ mpirun  ./mpi-test 1
> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 1: 0.00
> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 1: 0.00
> 
> 
> using openmpi-1.6.5 works with multiple cores:
> 
> $ mpirun -np 24 ./mpi-test 2
> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 24: 0.00
> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 12 on 106 out of 24: 12.00
> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 11 on 108 out of 24: 11.00
> 4.4.7 20120313 (Red Hat 4.4.7-4):Process 18 on 106 out of 24: 18.00
> 
> $ echo $SLURM_JOB_CPUS_PER_NODE 
> 8(x3)
> 
> I never used slurm before so this could also be a user error on my side.
> But as 1.6.5 works it seems something has changed and wanted to let
> you know in case it was not intentionally.
> 
>   Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber
I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a system
with slurm and moab. I requested an interactive session using:

msub -I -l nodes=3:ppn=8

and started a simple test case which fails:

$ mpirun -np 2 ./mpi-test 1
--
There are not enough slots available in the system to satisfy the 2 slots 
that were requested by the application:
  ./mpi-test

Either request fewer slots for your application, or make more slots available
for use.
--
srun: error: 108: task 1: Exited with exit code 1
srun: Terminating job step 131823.4
srun: error: 107: task 0: Exited with exit code 1
srun: Job step aborted
slurmd[108]: *** STEP 131823.4 KILLED AT 2014-02-12T13:30:32 WITH SIGNAL 9 
***


requesting only one core works:

$ mpirun  ./mpi-test 1
4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 1: 0.00
4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 1: 0.00


using openmpi-1.6.5 works with multiple cores:

$ mpirun -np 24 ./mpi-test 2
4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 24: 0.00
4.4.7 20120313 (Red Hat 4.4.7-4):Process 12 on 106 out of 24: 12.00
4.4.7 20120313 (Red Hat 4.4.7-4):Process 11 on 108 out of 24: 11.00
4.4.7 20120313 (Red Hat 4.4.7-4):Process 18 on 106 out of 24: 18.00

$ echo $SLURM_JOB_CPUS_PER_NODE 
8(x3)

I never used slurm before so this could also be a user error on my side.
But as 1.6.5 works it seems something has changed and wanted to let
you know in case it was not intentionally.

Adrian