I notice that BTLs are not checking the return value from ompi_modex_recv() for
OPAL_ERR_DATA_VALUE_NOT_FOUND (indicating that the peer process didn't put that
modex key). In the BTL context, NOT_FOUND means that that peer process doesn't
have this BTL, so this local peer process should probably mark it as
unreachable in add_procs().
This is on both trunk and the v1.8 branch.
The BTLs listed above are not checking/handling ompi_modex_recv() returning
OPAL_ERR_DATA_VALUE_NOT_FOUND properly. Most of these BTLs do something like
this:
-----
module_add_procs() {
loop over the peers {
proc = proc_create(...)
if (NULL == proc)
error!
....
}
}
proc_create(...) {
if (ompi_modex_recv() != OMPI_SUCCESS)
return NULL;
...
}
-----
The fix is to make proc_create() return something a bit more expressive so that
add_procs() can tell the difference between "error!" and "you can't reach this
peer".
I fixed this in the usnic BTL back in late March, but forgot to bring this to
everyone's attention -- oops. See
https://svn.open-mpi.org/trac/ompi/ticket/4442
--
Jeff Squyres
[email protected]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/