Folks,

the dynamic/spawn test from the ibm test suite crashes if the openib btl
is detected
(the test can be ran on one node with an IB port)

here is what happens :

in mca_btl_openib_proc_create,
the macro
    OPAL_MODEX_RECV(rc, &mca_btl_openib_component.super.btl_version,
                    proc, &message, &msg_size);
does not find any information *but*
rc is OPAL_SUCCESS
msg_size is not updated (e.g. left uninitialized)
message is not updated (e.g. left uninitialized)

then, if msg_size is unitialized with a non zero value, and if message
is uninitialized with
a non valid address, a crash will occur when accessing message.

/* i am not debating here the fact that there is no information returned,
i am simply discussing the crash */

a simple workaround is to initialize msg_size to zero.

that being said, is this the correct fix ?

one possible alternate fix is to update the OPAL_MODEX_RECV_STRING macro
like this :

/* from opal/mca/pmix/pmix.h */
#define OPAL_MODEX_RECV_STRING(r, s, p, d, sz)                          \
    do {                                                                \
        opal_value_t *kv;                                               \
        if (OPAL_SUCCESS == ((r) = opal_pmix.get(&(p)->proc_name,       \
                                                 (s), &kv))) {          \
            if (NULL != kv)
{                                               \
                *(d) =
kv->data.bo.bytes;                                   \
                *(sz) =
kv->data.bo.size;                                   \
                kv->data.bo.bytes = NULL; /* protect the data
*/            \

OBJ_RELEASE(kv);                                            \
            } else {                \
                *(sz) = 0;                    \
                (r) = OPAL_ERR_NOT_FOUND;
            }                     \
        }                                                               \
    } while(0);

/*
*(sz) = 0; and (r) = OPAL_ERR_NOT_FOUND; can be seen as redundant, *(sz)
*or* (r) could be set
*/

and an other alternate fix is to update the end of the native_get
function like this :

/* from opal/mca/pmix/native/pmix_native.c */

    if (found) {
        return OPAL_SUCCESS;
    }
    *kv = NULL;
    if (OPAL_SUCCESS == rc) {
        if (OPAL_SUCCESS == ret) {
            rc = OPAL_ERR_NOT_FOUND;
        } else {
            rc = ret;
        }
    }
    return rc;

Could you please advise ?

Cheers,

Gilles

Reply via email to