Folks, the dynamic/spawn test from the ibm test suite crashes if the openib btl is detected (the test can be ran on one node with an IB port)
here is what happens : in mca_btl_openib_proc_create, the macro OPAL_MODEX_RECV(rc, &mca_btl_openib_component.super.btl_version, proc, &message, &msg_size); does not find any information *but* rc is OPAL_SUCCESS msg_size is not updated (e.g. left uninitialized) message is not updated (e.g. left uninitialized) then, if msg_size is unitialized with a non zero value, and if message is uninitialized with a non valid address, a crash will occur when accessing message. /* i am not debating here the fact that there is no information returned, i am simply discussing the crash */ a simple workaround is to initialize msg_size to zero. that being said, is this the correct fix ? one possible alternate fix is to update the OPAL_MODEX_RECV_STRING macro like this : /* from opal/mca/pmix/pmix.h */ #define OPAL_MODEX_RECV_STRING(r, s, p, d, sz) \ do { \ opal_value_t *kv; \ if (OPAL_SUCCESS == ((r) = opal_pmix.get(&(p)->proc_name, \ (s), &kv))) { \ if (NULL != kv) { \ *(d) = kv->data.bo.bytes; \ *(sz) = kv->data.bo.size; \ kv->data.bo.bytes = NULL; /* protect the data */ \ OBJ_RELEASE(kv); \ } else { \ *(sz) = 0; \ (r) = OPAL_ERR_NOT_FOUND; } \ } \ } while(0); /* *(sz) = 0; and (r) = OPAL_ERR_NOT_FOUND; can be seen as redundant, *(sz) *or* (r) could be set */ and an other alternate fix is to update the end of the native_get function like this : /* from opal/mca/pmix/native/pmix_native.c */ if (found) { return OPAL_SUCCESS; } *kv = NULL; if (OPAL_SUCCESS == rc) { if (OPAL_SUCCESS == ret) { rc = OPAL_ERR_NOT_FOUND; } else { rc = ret; } } return rc; Could you please advise ? Cheers, Gilles