Hi Folks,

So, I've got a testbed cray system with no batch scheduler, just use the native
alps both as the resource manager and as the job launcher for the orte daemons.

What I'm noticing is that the mpirun command and -host option, or otherwise
trying to specify via an mpirun way, the nodes to run the app on is ignored.

In this sort of environment, ORTE is going to need to figure out how to load up
the aprun -L list_of_nids argument, but apparently doesn't do that.

Is this intended behavior?

Example:

crayadm@buffy:~/hpp> mpirun -np 2 -N 1  --debug-daemons --host 
nid00022,nid00021 ./my_script.sh
plm:alps aprun -n 2 -N 1 -cc none orted -mca orte_debug_daemons 1 -mca 
orte_ess_jobid 337444864 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 -mca 
orte_hnp_uri 337444864.0;tcp://10.128.0.3:39190
Daemon [[5149,0],1] checking in as pid 7398 on host 20
Daemon [[5149,0],1] checking in as pid 6540 on host 21

What's happening is that alps is just doing its default thing of grabbing the 
first set of nodes it can, which on
my little machine starts at nid00020.

What I'd prefer to see with ORTE and alps is that ORTE always build the 
explicit -L list_of_nids
so that a user can control the way the orte's are being launched, just as with 
using aprun directly
one can do either within a non alps managed batch environment and when a batch 
scheduler
is managing things and telling alps where to launch the job.

I had to use this -L feature a lot when debugging large customer system 
problems.

Can I assume LANL owns the alps pml component?

Howard


-------------------------------------------------
Howard Pritchard
HPC-5
Los Alamos National Laboratory


Reply via email to