We have some amount of MTT testing going on every night and on ONE of
our systems v1.5 has been dead since r25914. The system is
Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007
x86_64 x86_64 x86_64 GNU/Linux
and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256)
compilers. I haven't poked around enough yet to figure out what the
problematic characteristic of this configuration is.
In r25914, orte/mca/odls/base/odls_base_open.c, we get
222 /* get the number of local sockets unless we were given a
number */
223 if (0 == orte_default_num_sockets_per_board) {
224
opal_paffinity_base_get_socket_info(&orte_odls_globals.num_sockets);
225 }
226 /* get the number of local processors */
227
opal_paffinity_base_get_processor_info(&orte_odls_globals.num_processors);
228 /* compute the base number of cores/socket, if not given */
229 if (0 == orte_default_num_cores_per_socket) {
230 orte_odls_globals.num_cores_per_socket =
orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
231 }
Well, we execute the branch at line 224, but num_sockets remains 0.
This leads to the divide-by-0 at line 230. Digging deeper, the call at
line 224 led us to opal/mca/paffinity/hwloc/paffinity_hwloc_module.c
(lots of stuff left out):
static int module_get_socket_info(int *num_sockets) {
hwloc_topology_t *t = &opal_hwloc_topology;
*num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET);
return OPAL_SUCCESS;
}
Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0.
I can poke around more, but does someone want to advise?