Ralph,
i noticed MPI_Comm_spawn is broken on master and on RHEL7
for some reason i cannot yet explain, it works just fine on RHEL6 (!)
mpirun -np 1 ./dynamic/intercomm_create
from the ibm test suite can be used to reproduce the issue.
i digged a bit and i found OPAL_ERR_DEBUGGER_RELEASE is fired in mpirun,
then the tasks received
a PMIX_ERR_DEBUGGER_RELEASE notification. it seems no event handler is
registered, so the default handler
kills the task.
for the time being, a trivial workaround is not to fire
OPAL_ERR_DEBUGGER_RELEASE in the first place
(see patch below)
could you please have a look ?
i am not sure whether client should not be notified at all, or whether
they should register a dummy handler.
fwiw, in _event_hdlr, cd->nondefault is true on RHEL6, but false on
RHEL7, and that might indicate a race condition
Cheers,
Gilles
diff --git a/orte/orted/orted_submit.c b/orte/orted/orted_submit.c
index b9d571c..0de0e79 100644
--- a/orte/orted/orted_submit.c
+++ b/orte/orted/orted_submit.c
@@ -2155,6 +2155,7 @@ static bool mpir_breakpoint_fired = false;
static void _send_notification(void)
{
+#if 0
opal_buffer_t buf;
int status = OPAL_ERR_DEBUGGER_RELEASE;
orte_grpcomm_signature_t sig;
@@ -2209,6 +2210,7 @@ static void _send_notification(void)
}
OBJ_DESTRUCT(&sig);
OBJ_DESTRUCT(&buf);
+#endif
}
static void orte_debugger_dump(void)