Odd - I'm pretty sure it does indeed build the -L argument...and indeed, it 
does:

    for (nnode=0; nnode < map->nodes->size; nnode++) {
        if (NULL == (node = 
(orte_node_t*)opal_pointer_array_get_item(map->nodes, nnode))) {
            continue;
        }

        /* if the daemon already exists on this node, then
         * don't include it
         */
        if (ORTE_FLAG_TEST(node, ORTE_NODE_FLAG_DAEMON_LAUNCHED)) {
            continue;
        }
        
        /* otherwise, add it to the list of nodes upon which
         * we need to launch a daemon
         */
        opal_argv_append(&nodelist_argc, &nodelist_argv, node->name);
    }
    if (0 == opal_argv_count(nodelist_argv)) {
        orte_show_help("help-plm-alps.txt", "no-hosts-in-list", true);
        rc = ORTE_ERR_FAILED_TO_START;
        goto cleanup;
    }
    nodelist_flat = opal_argv_join(nodelist_argv, ',');
    opal_argv_free(nodelist_argv);

    /* if we are using all allocated nodes, then alps
     * doesn't need a nodelist
     */
    if (map->num_new_daemons < orte_num_allocated_nodes) {
        opal_argv_append(&argc, &argv, "-L");
        opal_argv_append(&argc, &argv, nodelist_flat);
    }


So maybe the --host option isn't working right for this environment? You could 
look at the setup_virtual_machine function in 
orte/mca/plm/base/plm_base_launch_support.c

Set "-mca plm_base_verbose 100 -mca ras_base_verbose 100" and it should tell 
you something about how it processed the allocation to define the VM.

There is also some oddball stuff Nathan inserted to redefine node location - 
maybe that is getting confused when running on partial allocations? It's in the 
same file, in the orte_plm_base_daemon_callback routine. Could be that the 
daemons actually are running on the nodes you specified, but think they are 
somewhere else.


On Sep 12, 2014, at 11:13 AM, Pritchard Jr., Howard <howa...@lanl.gov> wrote:

> Hi Folks,
>  
> So, I’ve got a testbed cray system with no batch scheduler, just use the 
> native
> alps both as the resource manager and as the job launcher for the orte 
> daemons.
>  
> What I’m noticing is that the mpirun command and –host option, or otherwise
> trying to specify via an mpirun way, the nodes to run the app on is ignored.
>  
> In this sort of environment, ORTE is going to need to figure out how to load 
> up
> the aprun –L list_of_nids argument, but apparently doesn’t do that.
>  
> Is this intended behavior?
>  
> Example:
>  
> crayadm@buffy:~/hpp> mpirun -np 2 -N 1  --debug-daemons --host 
> nid00022,nid00021 ./my_script.sh
> plm:alps aprun -n 2 -N 1 -cc none orted -mca orte_debug_daemons 1 -mca 
> orte_ess_jobid 337444864 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 -mca 
> orte_hnp_uri 337444864.0;tcp://10.128.0.3:39190
> Daemon [[5149,0],1] checking in as pid 7398 on host 20
> Daemon [[5149,0],1] checking in as pid 6540 on host 21
>  
> What’s happening is that alps is just doing its default thing of grabbing the 
> first set of nodes it can, which on
> my little machine starts at nid00020.
>  
> What I’d prefer to see with ORTE and alps is that ORTE always build the 
> explicit –L list_of_nids
> so that a user can control the way the orte’s are being launched, just as 
> with using aprun directly
> one can do either within a non alps managed batch environment and when a 
> batch scheduler
> is managing things and telling alps where to launch the job.
>  
> I had to use this –L feature a lot when debugging large customer system 
> problems.
>  
> Can I assume LANL owns the alps pml component?
>  
> Howard
>  
>  
> -------------------------------------------------
> Howard Pritchard
> HPC-5
> Los Alamos National Laboratory
>  
>  
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15820.php

Reply via email to