Hi Folks, So, I've got a testbed cray system with no batch scheduler, just use the native alps both as the resource manager and as the job launcher for the orte daemons.
What I'm noticing is that the mpirun command and -host option, or otherwise trying to specify via an mpirun way, the nodes to run the app on is ignored. In this sort of environment, ORTE is going to need to figure out how to load up the aprun -L list_of_nids argument, but apparently doesn't do that. Is this intended behavior? Example: crayadm@buffy:~/hpp> mpirun -np 2 -N 1 --debug-daemons --host nid00022,nid00021 ./my_script.sh plm:alps aprun -n 2 -N 1 -cc none orted -mca orte_debug_daemons 1 -mca orte_ess_jobid 337444864 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 -mca orte_hnp_uri 337444864.0;tcp://10.128.0.3:39190 Daemon [[5149,0],1] checking in as pid 7398 on host 20 Daemon [[5149,0],1] checking in as pid 6540 on host 21 What's happening is that alps is just doing its default thing of grabbing the first set of nodes it can, which on my little machine starts at nid00020. What I'd prefer to see with ORTE and alps is that ORTE always build the explicit -L list_of_nids so that a user can control the way the orte's are being launched, just as with using aprun directly one can do either within a non alps managed batch environment and when a batch scheduler is managing things and telling alps where to launch the job. I had to use this -L feature a lot when debugging large customer system problems. Can I assume LANL owns the alps pml component? Howard ------------------------------------------------- Howard Pritchard HPC-5 Los Alamos National Laboratory