Odd - I'm pretty sure it does indeed build the -L argument...and indeed, it does:
for (nnode=0; nnode < map->nodes->size; nnode++) { if (NULL == (node = (orte_node_t*)opal_pointer_array_get_item(map->nodes, nnode))) { continue; } /* if the daemon already exists on this node, then * don't include it */ if (ORTE_FLAG_TEST(node, ORTE_NODE_FLAG_DAEMON_LAUNCHED)) { continue; } /* otherwise, add it to the list of nodes upon which * we need to launch a daemon */ opal_argv_append(&nodelist_argc, &nodelist_argv, node->name); } if (0 == opal_argv_count(nodelist_argv)) { orte_show_help("help-plm-alps.txt", "no-hosts-in-list", true); rc = ORTE_ERR_FAILED_TO_START; goto cleanup; } nodelist_flat = opal_argv_join(nodelist_argv, ','); opal_argv_free(nodelist_argv); /* if we are using all allocated nodes, then alps * doesn't need a nodelist */ if (map->num_new_daemons < orte_num_allocated_nodes) { opal_argv_append(&argc, &argv, "-L"); opal_argv_append(&argc, &argv, nodelist_flat); } So maybe the --host option isn't working right for this environment? You could look at the setup_virtual_machine function in orte/mca/plm/base/plm_base_launch_support.c Set "-mca plm_base_verbose 100 -mca ras_base_verbose 100" and it should tell you something about how it processed the allocation to define the VM. There is also some oddball stuff Nathan inserted to redefine node location - maybe that is getting confused when running on partial allocations? It's in the same file, in the orte_plm_base_daemon_callback routine. Could be that the daemons actually are running on the nodes you specified, but think they are somewhere else. On Sep 12, 2014, at 11:13 AM, Pritchard Jr., Howard <howa...@lanl.gov> wrote: > Hi Folks, > > So, I’ve got a testbed cray system with no batch scheduler, just use the > native > alps both as the resource manager and as the job launcher for the orte > daemons. > > What I’m noticing is that the mpirun command and –host option, or otherwise > trying to specify via an mpirun way, the nodes to run the app on is ignored. > > In this sort of environment, ORTE is going to need to figure out how to load > up > the aprun –L list_of_nids argument, but apparently doesn’t do that. > > Is this intended behavior? > > Example: > > crayadm@buffy:~/hpp> mpirun -np 2 -N 1 --debug-daemons --host > nid00022,nid00021 ./my_script.sh > plm:alps aprun -n 2 -N 1 -cc none orted -mca orte_debug_daemons 1 -mca > orte_ess_jobid 337444864 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 -mca > orte_hnp_uri 337444864.0;tcp://10.128.0.3:39190 > Daemon [[5149,0],1] checking in as pid 7398 on host 20 > Daemon [[5149,0],1] checking in as pid 6540 on host 21 > > What’s happening is that alps is just doing its default thing of grabbing the > first set of nodes it can, which on > my little machine starts at nid00020. > > What I’d prefer to see with ORTE and alps is that ORTE always build the > explicit –L list_of_nids > so that a user can control the way the orte’s are being launched, just as > with using aprun directly > one can do either within a non alps managed batch environment and when a > batch scheduler > is managing things and telling alps where to launch the job. > > I had to use this –L feature a lot when debugging large customer system > problems. > > Can I assume LANL owns the alps pml component? > > Howard > > > ------------------------------------------------- > Howard Pritchard > HPC-5 > Los Alamos National Laboratory > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15820.php