Folks,
the dynamic/spawn test from the ibm test suite crashes if the openib btl
is detected
(the test can be ran on one node with an IB port)
here is what happens :
in mca_btl_openib_proc_create,
the macro
OPAL_MODEX_RECV(rc, &mca_btl_openib_component.super.btl_version,
proc, &message, &msg_size);
does not find any information *but*
rc is OPAL_SUCCESS
msg_size is not updated (e.g. left uninitialized)
message is not updated (e.g. left uninitialized)
then, if msg_size is unitialized with a non zero value, and if message
is uninitialized with
a non valid address, a crash will occur when accessing message.
/* i am not debating here the fact that there is no information returned,
i am simply discussing the crash */
a simple workaround is to initialize msg_size to zero.
that being said, is this the correct fix ?
one possible alternate fix is to update the OPAL_MODEX_RECV_STRING macro
like this :
/* from opal/mca/pmix/pmix.h */
#define OPAL_MODEX_RECV_STRING(r, s, p, d, sz) \
do { \
opal_value_t *kv; \
if (OPAL_SUCCESS == ((r) = opal_pmix.get(&(p)->proc_name, \
(s), &kv))) { \
if (NULL != kv)
{ \
*(d) =
kv->data.bo.bytes; \
*(sz) =
kv->data.bo.size; \
kv->data.bo.bytes = NULL; /* protect the data
*/ \
OBJ_RELEASE(kv); \
} else { \
*(sz) = 0; \
(r) = OPAL_ERR_NOT_FOUND;
} \
} \
} while(0);
/*
*(sz) = 0; and (r) = OPAL_ERR_NOT_FOUND; can be seen as redundant, *(sz)
*or* (r) could be set
*/
and an other alternate fix is to update the end of the native_get
function like this :
/* from opal/mca/pmix/native/pmix_native.c */
if (found) {
return OPAL_SUCCESS;
}
*kv = NULL;
if (OPAL_SUCCESS == rc) {
if (OPAL_SUCCESS == ret) {
rc = OPAL_ERR_NOT_FOUND;
} else {
rc = ret;
}
}
return rc;
Could you please advise ?
Cheers,
Gilles