We have some amount of MTT testing going on every night and on ONE of our systems v1.5 has been dead since r25914. The system is

Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 x86_64 x86_64 GNU/Linux

and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256) compilers. I haven't poked around enough yet to figure out what the problematic characteristic of this configuration is.

In r25914, orte/mca/odls/base/odls_base_open.c, we get

222 /* get the number of local sockets unless we were given a number */
    223     if (0 == orte_default_num_sockets_per_board) {
224 opal_paffinity_base_get_socket_info(&orte_odls_globals.num_sockets);
    225     }
    226     /* get the number of local processors */
227 opal_paffinity_base_get_processor_info(&orte_odls_globals.num_processors);
    228     /* compute the base number of cores/socket, if not given */
    229     if (0 == orte_default_num_cores_per_socket) {
230 orte_odls_globals.num_cores_per_socket = orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
    231     }

Well, we execute the branch at line 224, but num_sockets remains 0. This leads to the divide-by-0 at line 230. Digging deeper, the call at line 224 led us to opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff left out):

static int module_get_socket_info(int *num_sockets) {
    hwloc_topology_t *t = &opal_hwloc_topology;
    *num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET);
    return OPAL_SUCCESS;
}

Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0.

I can poke around more, but does someone want to advise?

Reply via email to