Believe this is now fixed with r21582 - let me know if it now works for you.
Sorry for the problem. It was indeed miscounting the number of daemons in the system, though apparently this wasn't causing problems for slurm and torque (still investigating why since it should have). Unfortunately, just changing the index caused shared memory to think everyone was remote, so the fix was a tad more involved - though not particularly difficult. Ralph On Wed, Jul 1, 2009 at 2:06 PM, Ralph Castain <r...@open-mpi.org> wrote: > Hmmm...I'll take a look. It seems to be working for me under Torque and > SLURM, though I cannot vouch for the tree launch. The problem with letting > the index start at 0 is it breaks other things, so I'll have to see about > fixing the routing schemes, or find some compromise. > > Thanks for the heads up. > Ralph > > > > On Wed, Jul 1, 2009 at 1:49 PM, George Bosilca <bosi...@eecs.utk.edu>wrote: > >> Ralph, >> >> This commit break several components in OMPI, mainly the routing schemes >> and the tree launch. The part with the problem is the reduction of the >> number of declared daemons on the second part of the commit, where you >> change the boundary for the for loop from 0 to 1. As a result the number of >> daemons was decreased by one (I guess in order to exclude the HNP), which is >> not something that the routing implementations tolerate. >> >> Setting the loop boundary back to 0 seems to fix all problems. Please >> reconsider your patch. >> >> george. >> >> On Fri, 26 Jun 2009, r...@osl.iu.edu wrote: >> >> Author: rhc >>> Date: 2009-06-26 18:07:25 EDT (Fri, 26 Jun 2009) >>> New Revision: 21548 >>> URL: https://svn.open-mpi.org/trac/ompi/changeset/21548 >>> >>> Log: >>> Cleanup some indexing bugs so that shared memory can function >>> >>> Text files modified: >>> trunk/orte/util/nidmap.c | 12 +++++++----- >>> 1 files changed, 7 insertions(+), 5 deletions(-) >>> >>> Modified: trunk/orte/util/nidmap.c >>> >>> ============================================================================== >>> --- trunk/orte/util/nidmap.c (original) >>> +++ trunk/orte/util/nidmap.c 2009-06-26 18:07:25 EDT (Fri, 26 Jun >>> 2009) >>> @@ -341,10 +341,10 @@ >>> >>> /* pack every nodename individually */ >>> for (i=1; i < orte_node_pool->size; i++) { >>> + if (NULL == (node = >>> (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, i))) { >>> + continue; >>> + } >>> if (!orte_keep_fqdn_hostnames) { >>> - if (NULL == (node = >>> (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, i))) { >>> - continue; >>> - } >>> nodename = strdup(node->name); >>> if (NULL != (ptr = strchr(nodename, '.'))) { >>> *ptr = '\0'; >>> @@ -553,6 +553,8 @@ >>> ORTE_ERROR_LOG(rc); >>> return rc; >>> } >>> + /* set the daemon to 0 */ >>> + node->daemon = 0; >>> >>> /* loop over nodes and unpack the raw nodename */ >>> for (i=1; i < num_nodes; i++) { >>> @@ -570,7 +572,7 @@ >>> } >>> } >>> >>> - /* unpack the daemon names */ >>> + /* unpack the daemon vpids */ >>> vpids = (orte_vpid_t*)malloc(num_nodes * sizeof(orte_vpid_t)); >>> n=num_nodes; >>> if (ORTE_SUCCESS != (rc = opal_dss.unpack(&buf, vpids, &n, >>> ORTE_VPID))) { >>> @@ -581,7 +583,7 @@ >>> * daemons in the system >>> */ >>> num_daemons = 0; >>> - for (i=0; i < num_nodes; i++) { >>> + for (i=1; i < num_nodes; i++) { >>> if (NULL != (ndptr = >>> (orte_nid_t*)opal_pointer_array_get_item(&orte_nidmap, i))) { >>> ndptr->daemon = vpids[i]; >>> if (ORTE_VPID_INVALID != vpids[i]) { >>> _______________________________________________ >>> svn mailing list >>> s...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/svn >>> >>> >> "We must accept finite disappointment, but we must never lose infinite >> hope." >> Martin Luther King >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > >