Re: [OMPI devel] SLURM affinity accounting in Open MPI

Artem Polyakov Fri, 7 Mar 2014 11:39:58 -0500 (EST)

Hello,

I think I am ready to return to mpirun affinity handling discussion. I have
more general solution now. It is beta-tested (just one cluster with SLURM
using cgroups plugin). But it shows my main idea and if it is worth to be
included into mainstream I am ready to polish or improove it.


The code respects SLURM cpu allocation through SLURM_JOB_CPUS_PER_NODE and
handles such cases correctly:
SLURM_JOB_CPUS_PER_NODE='12(x3),7'
SLURM_JOB_NODELIST=node[0-3]

It splits the node lists into groups having equal number of cpus. In the
example above we will have 2 groups:
1) node0, node1, node2 with 12 cpus
2) node3 with 7 cpus.

then it uses separate srun's for each group.

The weakness of this patch is that we need to deal with several srun's and
I am not sure that cleanup will perform correctly. I plan to test this case
additionaly.


2014-02-12 17:42 GMT+07:00 Artem Polyakov <artpo...@gmail.com>:

> Hello
>
> I found that SLURM installations that use cgroup plugin and
> have TaskAffinity=yes in cgroup.conf have problems with OpenMPI: all
> processes on non-launch node are assigned on one core. This leads to quite
> poor performance.
> The problem can be seen only if using mpirun to start parallel application
> in batch script. For example: *mpirun ./mympi*
> If using srun with PMI affinity is setted properly: *srun ./mympi.*
>
> Close look shows that the reason lies in the way Open MPI use srun to
> launch ORTE daemons. Here is example of the command line:
> *srun* *--nodes=1* *--ntasks=1* --kill-on-bad-exit --nodelist=node02
> *orted* -mca ess slurm -mca orte_ess_jobid 3799121920 -mca orte_ess_vpid
>
> Saying *--nodes=1* *--ntasks=1* to SLURM means that you want to start one
> task and (with TaskAffinity=yes) it will be binded to one core. Next orted
> use this affinity as base for all spawned branch processes. If I understand
> correctly the problem behind using srun is that if you say *srun*
> *--nodes=1* *--ntasks=4* - then SLURM will spawn 4 independent orted
> processes binded to different cores which is not what we really need.
>
> I found that disabling of cpu binding as a fast hack works good for cgroup
> plugin. Since job runs inside a group which has core access restrictions,
> spawned branch processes are executed under nodes scheduler control on all
> allocated cores. The command line looks like this:
> srun *--cpu_bind=none* --nodes=1 --ntasks=1 --kill-on-bad-exit
> --nodelist=node02 orted -mca ess slurm -mca orte_ess_jobid 3799121920 -mca
> orte_ess_vpid
>
> This solution will probably won't work with SLURM task/affinity plugin.
> Also it may be a bad idea when strong affinity desirable.
>
> My patch to stable Open MPI version (1.6.5) is attached to this e-mail. I
> will try to make more reliable solution but I need more time and beforehand
> would like to know opinion of Open MPI developers.
>
> --
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
>



-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

diff -Naur openmpi-1.6.5/orte/mca/plm/slurm/plm_slurm_module.c openmpi-1.6.5_new/orte/mca/plm/slurm/plm_slurm_module.c
--- openmpi-1.6.5/orte/mca/plm/slurm/plm_slurm_module.c	2012-04-03 10:30:29.000000000 -0400
+++ openmpi-1.6.5_new/orte/mca/plm/slurm/plm_slurm_module.c	2014-03-07 08:54:12.878010513 -0500
@@ -11,7 +11,7 @@
  *                         All rights reserved.
  * Copyright (c) 2006-2007 Cisco Systems, Inc.  All rights reserved.
  * Copyright (c) 2007      Los Alamos National Security, LLC.  All rights
- *                         reserved. 
+ *                         reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -107,12 +107,26 @@
 /*
  * Local variables
  */
-static pid_t primary_srun_pid = 0;
-static bool primary_pid_set = false;
+static pid_t *primary_srun_pids = NULL;
+static int primary_srun_pids_cnt = 0;
+static bool primary_pids_set = false;
 static orte_jobid_t active_job = ORTE_JOBID_INVALID;
 static bool launching_daemons;
 static bool local_launch_available = false;
 
+struct pml_slurm_node_group_t {
+    char **nodelist;
+    unsigned int nodelist_size;
+    unsigned int cpus_per_node;
+};
+
+static int plm_slurm_group_nodes(orte_job_map_t *map,
+                                 struct pml_slurm_node_group_t **grpout,
+                                 int *grpcnt);
+static void plm_slurm_groups_free(struct pml_slurm_node_group_t **groups,
+                                  int grpcnt);
+
+
 /**
 * Init the module
  */
@@ -152,7 +166,7 @@
     char *tmp;
     char** env = NULL;
     char* var;
-    char *nodelist_flat;
+    char *nodelist_flat, *nodelist_full;
     char **nodelist_argv;
     char *name_string;
     char **custom_strings;
@@ -163,11 +177,14 @@
     orte_jobid_t failed_job;
     bool failed_launch=true;
     bool using_regexp=false;
+    struct pml_slurm_node_group_t *groups = NULL;
+    int grpcnt, grpnum;
+    int vpid_offset = 0;
 
     if (NULL == jdata) {
-	/* just launching debugger daemons */
-	active_job = ORTE_JOBID_INVALID;
-	goto launch_apps;
+        /* just launching debugger daemons */
+        active_job = ORTE_JOBID_INVALID;
+        goto launch_apps;
     }
 
     if (jdata->controls & ORTE_JOB_CONTROL_DEBUGGER_DAEMON) {
@@ -205,7 +222,7 @@
             opal_output(0, "plm_slurm: could not obtain job start time");
             launchstart.tv_sec = 0;
             launchstart.tv_usec = 0;
-        }        
+        }
     }
     
     /* indicate the state of the launch */
@@ -223,7 +240,7 @@
                          ORTE_JOBID_PRINT(jdata->jobid)));
     
     /* set the active jobid */
-     active_job = jdata->jobid;
+    active_job = jdata->jobid;
     
     /* Get the map for this job */
     if (NULL == (map = orte_rmaps.get_job_map(active_job))) {
@@ -233,7 +250,7 @@
     }
     apps = (orte_app_context_t**)jdata->apps->addr;
     nodes = (orte_node_t**)map->nodes->addr;
-        
+
     if (0 == map->num_new_daemons) {
         /* no new daemons required - just launch apps */
         OPAL_OUTPUT_VERBOSE((1, orte_plm_globals.output,
@@ -242,44 +259,12 @@
         goto launch_apps;
     }
 
+    
     /* need integer value for command line parameter */
     asprintf(&jobid_string, "%lu", (unsigned long) jdata->jobid);
-
-    /*
-     * start building argv array
-     */
-    argv = NULL;
-    argc = 0;
-
-    /*
-     * SLURM srun OPTIONS
-     */
-
-    /* add the srun command */
-    opal_argv_append(&argc, &argv, "srun");
-
-    /* Append user defined arguments to srun */
-    if ( NULL != mca_plm_slurm_component.custom_args ) {
-        custom_strings = opal_argv_split(mca_plm_slurm_component.custom_args, ' ');
-        num_args       = opal_argv_count(custom_strings);
-        for (i = 0; i < num_args; ++i) {
-            opal_argv_append(&argc, &argv, custom_strings[i]);
-        }
-        opal_argv_free(custom_strings);
-    }
-
-    asprintf(&tmp, "--nodes=%lu", (unsigned long) map->num_new_daemons);
-    opal_argv_append(&argc, &argv, tmp);
-    free(tmp);
-
-    asprintf(&tmp, "--ntasks=%lu", (unsigned long) map->num_new_daemons);
-    opal_argv_append(&argc, &argv, tmp);
-    free(tmp);
-
-    /* alert us if any orteds die during startup */
-    opal_argv_append(&argc, &argv, "--kill-on-bad-exit");
-
-    /* create nodelist */
+    
+    
+    // prepare full node list
     nodelist_argv = NULL;
 
     for (n=0; n < map->num_nodes; n++ ) {
@@ -300,117 +285,192 @@
         rc = ORTE_ERR_FAILED_TO_START;
         goto cleanup;
     }
-    nodelist_flat = opal_argv_join(nodelist_argv, ',');
+    nodelist_full = opal_argv_join(nodelist_argv, ',');
     opal_argv_free(nodelist_argv);
-    asprintf(&tmp, "--nodelist=%s", nodelist_flat);
-    opal_argv_append(&argc, &argv, tmp);
-    free(tmp);
-
-    OPAL_OUTPUT_VERBOSE((2, orte_plm_globals.output,
-                         "%s plm:slurm: launching on nodes %s",
-                         ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), nodelist_flat));
-    
+
+
     /*
+     * start building argv array
+     */
+    if( rc = plm_slurm_group_nodes(map, &groups, &grpcnt) ){
+        goto cleanup;
+    }
+
+    argv = NULL;
+    argc = 0;
+    primary_srun_pids = malloc(sizeof(int)*grpcnt);
+    if( primary_srun_pids == NULL ){
+        ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
+        rc = ORTE_ERR_OUT_OF_RESOURCE;
+        goto cleanup;
+    }
+
+    grpnum = 0;
+    vpid_offset = 0;
+    while( grpnum < grpcnt ){
+        /*
+         * SLURM srun OPTIONS
+         */
+
+        /* add the srun command */
+        opal_argv_append(&argc, &argv, "srun");
+
+        /* Append user defined arguments to srun */
+        if ( NULL != mca_plm_slurm_component.custom_args ) {
+            custom_strings = opal_argv_split(mca_plm_slurm_component.custom_args, ' ');
+            num_args       = opal_argv_count(custom_strings);
+            for (i = 0; i < num_args; ++i) {
+                opal_argv_append(&argc, &argv, custom_strings[i]);
+            }
+            opal_argv_free(custom_strings);
+        }
+
+        asprintf(&tmp, "--nodes=%lu", (unsigned long) groups[grpnum].nodelist_size);
+        opal_argv_append(&argc, &argv, tmp);
+        free(tmp);
+
+        asprintf(&tmp, "--ntasks=%lu", (unsigned long) groups[grpnum].nodelist_size);
+        opal_argv_append(&argc, &argv, tmp);
+        free(tmp);
+
+        if( groups[grpnum].cpus_per_node > 1 ){
+            asprintf(&tmp, "--cpus-per-task=%lu", (unsigned long) groups[grpnum].cpus_per_node);
+            opal_argv_append(&argc, &argv, tmp);
+            free(tmp);
+        }
+
+        /* alert us if any orteds die during startup */
+        opal_argv_append(&argc, &argv, "--kill-on-bad-exit");
+
+        /* create nodelist */
+        if (0 == groups[grpnum].nodelist_size ) {
+            orte_show_help("help-plm-slurm.txt", "no-hosts-in-list", true);
+            rc = ORTE_ERR_FAILED_TO_START;
+            goto cleanup;
+        }
+        nodelist_flat = opal_argv_join(groups[grpnum].nodelist, ',');
+        asprintf(&tmp, "--nodelist=%s", nodelist_flat);
+        opal_argv_append(&argc, &argv, tmp);
+        free(tmp);
+
+        OPAL_OUTPUT_VERBOSE((2, orte_plm_globals.output,
+                             "%s plm:slurm: launching on nodes %s",
+                             ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), nodelist_flat));
+
+        /*
      * ORTED OPTIONS
      */
 
-    /* add the daemon command (as specified by user) */
-    orte_plm_base_setup_orted_cmd(&argc, &argv);
-    
-   /* Add basic orted command line options, including debug flags */
-    orte_plm_base_orted_append_basic_args(&argc, &argv,
-                                          "slurm", &proc_vpid_index,
-                                          false, nodelist_flat);
-    free(nodelist_flat);
+        /* add the daemon command (as specified by user) */
+        orte_plm_base_setup_orted_cmd(&argc, &argv);
 
-    /* tell the new daemons the base of the name list so they can compute
+        /* Add basic orted command line options, including debug flags */
+        orte_plm_base_orted_append_basic_args(&argc, &argv,
+                                              "slurm", &proc_vpid_index,
+                                              false, nodelist_full);
+        free(nodelist_flat);
+
+        /* tell the new daemons the base of the name list so they can compute
      * their own name on the other end
      */
-    rc = orte_util_convert_vpid_to_string(&name_string, map->daemon_vpid_start);
-    if (ORTE_SUCCESS != rc) {
-        opal_output(0, "plm_slurm: unable to get daemon vpid as string");
-        goto cleanup;
-    }
+        rc = orte_util_convert_vpid_to_string(&name_string, map->daemon_vpid_start + vpid_offset);
+        if (ORTE_SUCCESS != rc) {
+            opal_output(0, "plm_slurm: unable to get daemon vpid as string");
+            goto cleanup;
+        }
 
-    free(argv[proc_vpid_index]);
-    argv[proc_vpid_index] = strdup(name_string);
-    free(name_string);
+        free(argv[proc_vpid_index]);
+        argv[proc_vpid_index] = strdup(name_string);
+        free(name_string);
 
-    /* Copy the prefix-directory specified in the
+        /* Copy the prefix-directory specified in the
        corresponding app_context.  If there are multiple,
        different prefix's in the app context, complain (i.e., only
        allow one --prefix option for the entire slurm run -- we
        don't support different --prefix'es for different nodes in
        the SLURM plm) */
-    cur_prefix = NULL;
-    for (n=0; n < jdata->num_apps; n++) {
-        char * app_prefix_dir = apps[n]->prefix_dir;
-         /* Check for already set cur_prefix_dir -- if different,
+        cur_prefix = NULL;
+        for (n=0; n < jdata->num_apps; n++) {
+            char * app_prefix_dir = apps[n]->prefix_dir;
+            /* Check for already set cur_prefix_dir -- if different,
            complain */
-        if (NULL != app_prefix_dir) {
-            if (NULL != cur_prefix &&
-                0 != strcmp (cur_prefix, app_prefix_dir)) {
-                orte_show_help("help-plm-slurm.txt", "multiple-prefixes",
-                               true, cur_prefix, app_prefix_dir);
-                return ORTE_ERR_FATAL;
-            }
+            if (NULL != app_prefix_dir) {
+                if (NULL != cur_prefix &&
+                        0 != strcmp (cur_prefix, app_prefix_dir)) {
+                    orte_show_help("help-plm-slurm.txt", "multiple-prefixes",
+                                   true, cur_prefix, app_prefix_dir);
+                    return ORTE_ERR_FATAL;
+                }
 
-            /* If not yet set, copy it; iff set, then it's the
+                /* If not yet set, copy it; iff set, then it's the
              * same anyway
              */
-            if (NULL == cur_prefix) {
-                cur_prefix = strdup(app_prefix_dir);
-                OPAL_OUTPUT_VERBOSE((1, orte_plm_globals.output,
-                                     "%s plm:slurm: Set prefix:%s",
-                                     ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
-                                     cur_prefix));
+                if (NULL == cur_prefix) {
+                    cur_prefix = strdup(app_prefix_dir);
+                    OPAL_OUTPUT_VERBOSE((1, orte_plm_globals.output,
+                                         "%s plm:slurm: Set prefix:%s",
+                                         ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
+                                         cur_prefix));
+                }
             }
         }
-    }
 
-    /* setup environment */
-    env = opal_argv_copy(orte_launch_environ);
+        /* setup environment */
+        env = opal_argv_copy(orte_launch_environ);
 
-    /* enable local launch by the orteds */
-    var = mca_base_param_environ_variable("plm", NULL, NULL);
-    opal_setenv(var, "rsh", true, &env);
-    free(var);
-    
-    /* if we can do it, use the regexp to launch the apps - this
+        /* enable local launch by the orteds */
+        var = mca_base_param_environ_variable("plm", NULL, NULL);
+        opal_setenv(var, "rsh", true, &env);
+        free(var);
+
+        /* if we can do it, use the regexp to launch the apps - this
      * requires that the user requested this mode, that we were
      * provided with static ports, and that we only have one
      * app_context
      */
-    if (orte_use_regexp && orte_static_ports && jdata->num_apps < 2) {
-        char *regexp;
-        regexp = orte_regex_encode_maps(jdata);
-        opal_argv_append(&argc, &argv, "--launch");
-        opal_argv_append(&argc, &argv, regexp);
-        free(regexp);
-        using_regexp = true;
-    }
-    
-    if (0 < opal_output_get_verbosity(orte_plm_globals.output)) {
-        param = opal_argv_join(argv, ' ');
-        OPAL_OUTPUT_VERBOSE((1, orte_plm_globals.output,
-                             "%s plm:slurm: final top-level argv:\n\t%s",
-                             ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
-                             (NULL == param) ? "NULL" : param));
-        if (NULL != param) free(param);
-    }
-    
-    /* exec the daemon(s) */
-    if (ORTE_SUCCESS != (rc = plm_slurm_start_proc(argc, argv, env, cur_prefix))) {
-        ORTE_ERROR_LOG(rc);
-        goto cleanup;
+        if (orte_use_regexp && orte_static_ports && jdata->num_apps < 2) {
+            char *regexp;
+            regexp = orte_regex_encode_maps(jdata);
+            opal_argv_append(&argc, &argv, "--launch");
+            opal_argv_append(&argc, &argv, regexp);
+            free(regexp);
+            using_regexp = true;
+        }
+
+        if (0 < opal_output_get_verbosity(orte_plm_globals.output)) {
+            param = opal_argv_join(argv, ' ');
+            OPAL_OUTPUT_VERBOSE((1, orte_plm_globals.output,
+                                 "%s plm:slurm: final top-level argv:\n\t%s",
+                                 ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
+                                 (NULL == param) ? "NULL" : param));
+            if (NULL != param) free(param);
+        }
+
+        //param = opal_argv_join(argv, ' ');
+        //printf("Final command line: %s\n",param);
+
+        /* exec the daemon(s) */
+        if (ORTE_SUCCESS != (rc = plm_slurm_start_proc(argc, argv, env, cur_prefix))) {
+            ORTE_ERROR_LOG(rc);
+            goto cleanup;
+        }
+
+        vpid_offset += groups[grpnum].nodelist_size;
+        grpnum++;
+        argc = 0;
+        opal_argv_free(argv);
+        argv = NULL;
     }
+
+    primary_pids_set = true;
     
     /* do NOT wait for srun to complete. Srun only completes when the processes
      * it starts - in this case, the orteds - complete. Instead, we'll catch
      * any srun failures and deal with them elsewhere
      */
-    
+    //plm_affinity_free(classes);
     /* wait for daemons to callback */
+
     if (ORTE_SUCCESS != (rc = orte_plm_base_daemon_callback(map->num_new_daemons))) {
         OPAL_OUTPUT_VERBOSE((1, orte_plm_globals.output,
                              "%s plm:slurm: daemon launch failed for job %s on error %s",
@@ -460,12 +520,12 @@
     
     if (orte_timing) {
         if (0 != gettimeofday(&launchstop, NULL)) {
-             opal_output(0, "plm_slurm: could not obtain stop time");
-         } else {
-             opal_output(0, "plm_slurm: total job launch time is %ld usec",
-                         (launchstop.tv_sec - launchstart.tv_sec)*1000000 + 
-                         (launchstop.tv_usec - launchstart.tv_usec));
-         }
+            opal_output(0, "plm_slurm: could not obtain stop time");
+        } else {
+            opal_output(0, "plm_slurm: total job launch time is %ld usec",
+                        (launchstop.tv_sec - launchstart.tv_sec)*1000000 +
+                        (launchstop.tv_usec - launchstart.tv_usec));
+        }
     }
 
     if (ORTE_SUCCESS != rc) {
@@ -474,6 +534,7 @@
     }
 
 cleanup:
+
     if (NULL != argv) {
         opal_argv_free(argv);
     }
@@ -485,6 +546,14 @@
         free(jobid_string);
     }
     
+    if( NULL != primary_srun_pids ){
+        free(primary_srun_pids);
+    }
+
+    if( groups != NULL ){
+        plm_slurm_groups_free(&groups, grpcnt);
+    }
+    
     /* check for failed launch - if so, force terminate */
     if (failed_launch) {
         orte_plm_base_launch_failed(failed_job, -1, ORTE_ERROR_DEFAULT_EXIT_CODE, ORTE_JOB_STATE_FAILED_TO_START);
@@ -514,7 +583,7 @@
      * not wait for a waitpid to fire and tell us it's okay to
      * exit. Instead, we simply trigger an exit for ourselves
      */
-    if (!primary_pid_set) {
+    if (!primary_pids_set) {
         OPAL_OUTPUT_VERBOSE((1, orte_plm_globals.output,
                              "%s plm:slurm: primary daemons complete!",
                              ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
@@ -560,6 +629,7 @@
 
 static void srun_wait_cb(pid_t pid, int status, void* cbdata){
     orte_job_t *jdata;
+    int i;
     
     /* According to the SLURM folks, srun always returns the highest exit
      code of our remote processes. Thus, a non-zero exit status doesn't
@@ -602,18 +672,20 @@
             orte_plm_base_launch_failed(ORTE_PROC_MY_NAME->jobid, -1, status, ORTE_JOB_STATE_ABORTED);
         }
         /* otherwise, check to see if this is the primary pid */
-        if (primary_srun_pid == pid) {
-            /* in this case, we just want to fire the proper trigger so
-             * mpirun can exit
-             */
-            OPAL_OUTPUT_VERBOSE((1, orte_plm_globals.output,
-                                 "%s plm:slurm: primary daemons complete!",
-                                 ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
-            jdata = orte_get_job_data_object(ORTE_PROC_MY_NAME->jobid);
-            jdata->state = ORTE_JOB_STATE_TERMINATED;
-            /* need to set the #terminated value to avoid an incorrect error msg */
-            jdata->num_terminated = jdata->num_procs;
-            orte_trigger_event(&orteds_exit);
+        for(i=0; i<primary_srun_pids_cnt; i++){
+            if (primary_srun_pids[i] == pid) {
+                /* in this case, we just want to fire the proper trigger so
+                 * mpirun can exit
+                 */
+                OPAL_OUTPUT_VERBOSE((1, orte_plm_globals.output,
+                                     "%s plm:slurm: primary daemons complete!",
+                                     ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
+                jdata = orte_get_job_data_object(ORTE_PROC_MY_NAME->jobid);
+                jdata->state = ORTE_JOB_STATE_TERMINATED;
+                /* need to set the #terminated value to avoid an incorrect error msg */
+                jdata->num_terminated = jdata->num_procs;
+                orte_trigger_event(&orteds_exit);
+            }
         }
     }
 }
@@ -692,7 +764,7 @@
          * EXCEPT if the user has requested that we leave sessions attached
          */
         if (0 >= opal_output_get_verbosity(orte_plm_globals.output) &&
-            !orte_debug_daemons_flag && !orte_leave_session_attached) {
+                !orte_debug_daemons_flag && !orte_leave_session_attached) {
             if (fd >= 0) {
                 if (fd != 1) {
                     dup2(fd,1);
@@ -727,9 +799,8 @@
         /* if this is the primary launch - i.e., not a comm_spawn of a
          * child job - then save the pid
          */
-        if (!primary_pid_set) {
-            primary_srun_pid = srun_pid;
-            primary_pid_set = true;
+        if (!primary_pids_set) {
+            primary_srun_pids[ primary_srun_pids_cnt++] = srun_pid;
         }
         
         /* setup the waitpid so we can find out if srun succeeds! */
@@ -739,3 +810,136 @@
 
     return ORTE_SUCCESS;
 }
+
+
+static int plm_slurm_group_nodes(orte_job_map_t *map,
+                                 struct pml_slurm_node_group_t **grpout,
+                                 int *grpcnt)
+{
+    int estimsize = sizeof(int)*orte_node_pool->size;
+    int *grp_cpus, *grp_ncount;
+    int group_n = 0, groups_size;
+    orte_node_t **nodes = (orte_node_t**)map->nodes->addr;
+    int n,j;
+    struct pml_slurm_node_group_t *groups;
+    int *cur_grpcnt;
+
+    grp_cpus = malloc(estimsize);
+    grp_ncount = malloc(estimsize);
+    memset(grp_ncount, 0, estimsize);
+    if( grp_cpus == NULL || grp_ncount == NULL ){
+        ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
+        return ORTE_ERR_OUT_OF_RESOURCE;
+    }
+
+    for (n=0; n < map->num_nodes; n++ ) {
+        int found = 0;
+        if (nodes[n]->daemon_launched) {
+            continue;
+        }
+        for(j=0; j<group_n; j++){
+            if( grp_cpus[j] == nodes[n]->cpus ) {
+                found = 1;
+                grp_ncount[j]++;
+                break;
+            }
+        }
+        if( found )
+            continue;
+        grp_cpus[group_n] = nodes[n]->cpus;
+        grp_ncount[group_n] = 1;
+        group_n++;
+    }
+/*
+    {
+        int i;
+        printf("\tclass_cnt = %d, class content is:\n",class_n);
+        for(i=0;i<class_n;i++){
+            printf("\t\tval = %d, count = %d\n",classval[i], classcnt[i]);
+        }
+    }
+*/
+    OPAL_OUTPUT_VERBOSE((1, orte_plm_globals.output,
+                         "%s plm:slurm: foud %d node groups",
+                         group_n));
+
+    groups_size = sizeof(struct pml_slurm_node_group_t)*group_n;
+    *grpout = groups = malloc(groups_size);
+    *grpcnt = group_n;
+    cur_grpcnt = malloc(sizeof(int)*group_n);
+    if( groups == NULL || cur_grpcnt == NULL )
+    {
+        ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
+        return ORTE_ERR_OUT_OF_RESOURCE;
+    }
+    memset(groups, 0, groups_size);
+    memset(cur_grpcnt, 0, sizeof(int)*group_n);
+
+    //printf("\tStart classes fillup\n");
+    for (n=0; n < map->num_nodes; n++ ) {
+        int found = 0;
+        if (nodes[n]->daemon_launched) {
+            //printf("\tSkip %s\n",nodes[n]->name);
+            continue;
+        }
+        for(j=0; j<group_n; j++){
+            //printf("\tCompare classval[%d] = %d with nodes[%d]->cpus = %d\n",
+            //       j, classval[j], n, nodes[n]->cpus);
+            if( grp_cpus[j] == nodes[n]->cpus ) {
+                found = 1;
+                break;
+            }
+        }
+        if( !found ){
+            printf("Something is wrong, class for node %s with %d cpus not found\n",
+                   nodes[n]->name, nodes[n]->cpus);
+            return -1;
+        }
+        //printf("\tprocess %s, class #%d\n",nodes[n]->name, j);
+        if( cur_grpcnt[j] == 0 ){
+            int i, size = sizeof(char*)*(grp_ncount[j] + 1);
+            //printf("\t\tneed to initialize: nodes = %d, cpu per node = %d\n",
+            //       classcnt[j], classval[j]);
+            groups[j].cpus_per_node = grp_cpus[j];
+            groups[j].nodelist_size = grp_ncount[j];
+            // reserve 1 element for NULL
+            groups[j].nodelist = malloc( size );
+            for(i=0; i<(grp_ncount[j] + 1); i++){
+                groups[j].nodelist[i] = NULL;
+            }
+        }
+        groups[j].nodelist[cur_grpcnt[j]] = nodes[n]->name;
+        cur_grpcnt[j]++;
+    }
+/*
+    {
+        int i, j;
+        printf("\tclasses is filled!:\n");
+        for(i=0;i<class_n;i++){
+            printf("\t\tcpus per node = %d, nodes: ", classes[i].cpus_per_node);
+            j = 0;
+            while( classes[i].nodelist[j] != NULL ){
+                printf("%s ", classes[i].nodelist[j] );
+                j++;
+            }
+            printf("\n");
+        }
+    }
+*/
+    // Free temp data structures
+    free(grp_cpus);
+    free(grp_ncount);
+    free(cur_grpcnt);
+    return 0;
+}
+
+static void plm_slurm_groups_free(struct pml_slurm_node_group_t **groups,
+                                  int grpcnt)
+{
+    int i;
+    for(i=0; i < grpcnt; i++){
+        free( (*groups)[i].nodelist );
+    }
+    free(*groups);
+    *groups = NULL;
+}
diff -Naur openmpi-1.6.5/orte/mca/ras/slurm/ras_slurm_module.c openmpi-1.6.5_new/orte/mca/ras/slurm/ras_slurm_module.c
--- openmpi-1.6.5/orte/mca/ras/slurm/ras_slurm_module.c	2012-04-03 10:30:28.000000000 -0400
+++ openmpi-1.6.5_new/orte/mca/ras/slurm/ras_slurm_module.c	2014-03-07 08:25:45.187430530 -0500
@@ -43,8 +43,8 @@
 static int orte_ras_slurm_allocate(opal_list_t *nodes);
 static int orte_ras_slurm_finalize(void);
 
-static int orte_ras_slurm_discover(char *regexp, char* tasks_per_node,
-                                   opal_list_t *nodelist);
+static int orte_ras_slurm_discover(char *regexp, char *tasks_per_node,
+                                   char *cpus_per_node, opal_list_t* nodelist);
 static int orte_ras_slurm_parse_ranges(char *base, char *ranges, char ***nodelist);
 static int orte_ras_slurm_parse_range(char *base, char *range, char ***nodelist);
 
@@ -65,9 +65,10 @@
  */
 static int orte_ras_slurm_allocate(opal_list_t *nodes)
 {
-    int ret, cpus_per_task;
+    int ret;
     char *slurm_node_str, *regexp;
-    char *tasks_per_node, *node_tasks;
+    char *tasks_per_node, *cpus_per_node;
+    char *node_tasks, *node_cpus;
     char * tmp;
     char *slurm_jobid;
     
@@ -89,34 +90,22 @@
     regexp = strdup(slurm_node_str);
     
     tasks_per_node = getenv("SLURM_TASKS_PER_NODE");
-    if (NULL == tasks_per_node) {
+    cpus_per_node = getenv("SLURM_JOB_CPUS_PER_NODE");
+    if (NULL == tasks_per_node || NULL == cpus_per_node ) {
         /* couldn't find any version - abort */
         orte_show_help("help-ras-slurm.txt", "slurm-env-var-not-found", 1,
-                       "SLURM_TASKS_PER_NODE");
+                       "SLURM_TASKS_PER_NODE or SLURM_JOB_CPUS_PER_NODE");
         return ORTE_ERR_NOT_FOUND;
     }
     node_tasks = strdup(tasks_per_node);
+    node_cpus = strdup(cpus_per_node);
 
-    if(NULL == regexp || NULL == node_tasks) {
+    if( NULL == regexp || NULL == node_tasks || NULL == node_cpus ) {
         ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
         return ORTE_ERR_OUT_OF_RESOURCE;
     }
 
-    /* get the number of CPUs per task that the user provided to slurm */
-    tmp = getenv("SLURM_CPUS_PER_TASK");
-    if(NULL != tmp) {
-        cpus_per_task = atoi(tmp);
-        if(0 >= cpus_per_task) {
-            opal_output(0, "ras:slurm:allocate: Got bad value from SLURM_CPUS_PER_TASK. "
-                        "Variable was: %s\n", tmp);
-            ORTE_ERROR_LOG(ORTE_ERROR);
-            return ORTE_ERROR;
-        }
-    } else {
-        cpus_per_task = 1;
-    }
- 
-    ret = orte_ras_slurm_discover(regexp, node_tasks, nodes);
+    ret = orte_ras_slurm_discover(regexp, node_tasks, node_cpus, nodes);
     free(regexp);
     free(node_tasks);
     if (ORTE_SUCCESS != ret) {
@@ -153,6 +142,53 @@
 }
 
 
+static int process_task_or_cpus(char *resource, int *elems, int num_nodes)
+{
+    char *begptr = resource, *endptr;
+    int reps, count;
+    int j = 0, i;
+
+    while (begptr) {
+        count = strtol(begptr, &endptr, 10);
+        if ((endptr[0] == '(') && (endptr[1] == 'x')) {
+            reps = strtol((endptr+2), &endptr, 10);
+            if (endptr[0] == ')') {
+                endptr++;
+            }
+        } else {
+            reps = 1;
+        }
+
+        /**
+     * TBP: it seems like it would be an error to have more slot
+     * descriptions than nodes. Turns out that this valid, and SLURM will
+     * return such a thing. For instance, if I did:
+     * srun -A -N 30 -w odin001
+     * I would get SLURM_NODELIST=odin001 SLURM_TASKS_PER_NODE=4(x30)
+     * That is, I am allocated 30 nodes, but since I only requested
+     * one specific node, that's what is in the nodelist.
+     * I'm not sure this is what users would expect, but I think it is
+     * more of a SLURM issue than a orte issue, since SLURM is OK with it,
+     * I'm ok with it
+     */
+        for (i = 0; i < reps && j < num_nodes; i++) {
+            elems[j++] = count;
+        }
+
+        if (*endptr == ',') {
+            begptr = endptr + 1;
+        } else if (*endptr == '\0' || j >= num_nodes) {
+            break;
+        } else {
+            orte_show_help("help-ras-slurm.txt", "slurm-env-var-bad-value", 1,
+                           resource, "SLURM_TASKS_PER_NODE or SLURM_JOB_CPUS_PER_NODE");
+            ORTE_ERROR_LOG(ORTE_ERR_BAD_PARAM);
+            return ORTE_ERR_BAD_PARAM;
+        }
+    }
+    return 0;
+}
+
 /**
  * Discover the available resources.
  * 
@@ -169,12 +205,13 @@
  *                  the found nodes in
  */
 static int orte_ras_slurm_discover(char *regexp, char *tasks_per_node,
-                                   opal_list_t* nodelist)
+                                   char *cpus_per_node, opal_list_t* nodelist)
 {
     int i, j, len, ret, count, reps, num_nodes;
     char *base, **names = NULL;
     char *begptr, *endptr, *orig;
     int *slots;
+    int *cpus;
     bool found_range = false;
     bool more_to_come = false;
     
@@ -284,57 +321,68 @@
     }
     memset(slots, 0, sizeof(int) * num_nodes);
     
-    orig = begptr = strdup(tasks_per_node);
-    if (NULL == begptr) {
+    orig = strdup(tasks_per_node);
+    if (NULL == orig) {
         ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
         free(slots);
         return ORTE_ERR_OUT_OF_RESOURCE;
     }
-    
-    j = 0;
-    while (begptr) {
-        count = strtol(begptr, &endptr, 10);
-        if ((endptr[0] == '(') && (endptr[1] == 'x')) {
-            reps = strtol((endptr+2), &endptr, 10);
-            if (endptr[0] == ')') {
-                endptr++;
-            }
-        } else {
-            reps = 1;
-        }
 
-        /**
-         * TBP: it seems like it would be an error to have more slot 
-         * descriptions than nodes. Turns out that this valid, and SLURM will
-         * return such a thing. For instance, if I did:
-         * srun -A -N 30 -w odin001
-         * I would get SLURM_NODELIST=odin001 SLURM_TASKS_PER_NODE=4(x30)
-         * That is, I am allocated 30 nodes, but since I only requested
-         * one specific node, that's what is in the nodelist.
-         * I'm not sure this is what users would expect, but I think it is
-         * more of a SLURM issue than a orte issue, since SLURM is OK with it,
-         * I'm ok with it
-         */
-        for (i = 0; i < reps && j < num_nodes; i++) {
-            slots[j++] = count;
-        }
-            
-        if (*endptr == ',') {
-            begptr = endptr + 1;
-        } else if (*endptr == '\0' || j >= num_nodes) {
-            break;
-        } else {
-            orte_show_help("help-ras-slurm.txt", "slurm-env-var-bad-value", 1,
-                           regexp, tasks_per_node, "SLURM_TASKS_PER_NODE");
-            ORTE_ERROR_LOG(ORTE_ERR_BAD_PARAM);
-            free(slots);
-            free(orig);
-            return ORTE_ERR_BAD_PARAM;
+    if( process_task_or_cpus(orig,slots,num_nodes) ){
+        free(orig);
+        free(slots);
+        orte_show_help("help-ras-slurm.txt", "slurm-env-var-bad-value", 1,
+                       slots, "SLURM_TASKS_PER_NODE");
+        ORTE_ERROR_LOG(ORTE_ERR_BAD_PARAM);
+        return ORTE_ERR_BAD_PARAM;
+    }
+    free(orig);
+/*
+    {
+        int i;
+        printf("orte_ras_slurm_discover: slots are processed:\n");
+        for(i=0;i<num_nodes;i++){
+            printf("\t%s, slots = %d\n",names[i],slots[i]);
         }
     }
+*/
+    /* Find the number of cpus per node */
+
+    cpus = malloc(sizeof(int) * num_nodes);
+    if (NULL == cpus) {
+        ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
+        return ORTE_ERR_OUT_OF_RESOURCE;
+    }
+    memset(cpus, 0, sizeof(int) * num_nodes);
 
+    orig = strdup(cpus_per_node);
+    if (NULL == orig) {
+        ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
+        free(slots);
+        return ORTE_ERR_OUT_OF_RESOURCE;
+    }
+
+    if( ret = process_task_or_cpus(orig,cpus,num_nodes) ){
+        free(orig);
+        free(slots);
+        free(cpus);
+        orte_show_help("help-ras-slurm.txt", "slurm-env-var-bad-value", 1,
+                       cpus, "SLURM_JOB_CPUS_PER_NODE");
+        ORTE_ERROR_LOG(ORTE_ERR_BAD_PARAM);
+        return ORTE_ERR_BAD_PARAM;
+        return ret;
+    }
     free(orig);
 
+/*
+    {
+        int i;
+        printf("orte_ras_slurm_discover: cpus are processed:\n");
+        for(i=0;i<num_nodes;i++){
+            printf("\t%s, cpus = %d\n",names[i],cpus[i]);
+        }
+    }
+*/
     /* Convert the argv of node names to a list of node_t's */
 
     for (i = 0; NULL != names && NULL != names[i]; ++i) {
@@ -356,6 +404,7 @@
         node->slots_inuse = 0;
         node->slots_max = 0;
         node->slots = slots[i];
+        node->cpus = cpus[i];
         opal_list_append(nodelist, &node->super);
     }
     free(slots);
diff -Naur openmpi-1.6.5/orte/runtime/orte_globals.h openmpi-1.6.5_new/orte/runtime/orte_globals.h
--- openmpi-1.6.5/orte/runtime/orte_globals.h	2012-09-07 11:48:03.000000000 -0400
+++ openmpi-1.6.5_new/orte/runtime/orte_globals.h	2014-03-07 08:24:51.439655951 -0500
@@ -260,6 +260,9 @@
         specified limit.  For example, if we have two processors, we
         may want to allow up to four processes but no more. */
     orte_std_cntr_t slots_max;
+    /** Number of allocated cpus on the node. Considers cpus per task
+     *  setting of srun and friends */
+    orte_std_cntr_t cpus;
     /* number of physical boards in the node - defaults to 1 */
     uint8_t boards;
     /* number of sockets on each board - defaults to 1 */

Re: [OMPI devel] SLURM affinity accounting in Open MPI

Reply via email to