Ralph,

i noticed MPI_Comm_spawn is broken on master and on RHEL7

for some reason i cannot yet explain, it works just fine on RHEL6 (!)


mpirun -np 1 ./dynamic/intercomm_create

from the ibm test suite can be used to reproduce the issue.



i digged a bit and i found OPAL_ERR_DEBUGGER_RELEASE is fired in mpirun, then the tasks received

a PMIX_ERR_DEBUGGER_RELEASE notification. it seems no event handler is registered, so the default handler

kills the task.


for the time being, a trivial workaround is not to fire OPAL_ERR_DEBUGGER_RELEASE in the first place

(see patch below)


could you please have a look ?

i am not sure whether client should not be notified at all, or whether they should register a dummy handler.

fwiw, in _event_hdlr, cd->nondefault is true on RHEL6, but false on RHEL7, and that might indicate a race condition


Cheers,


Gilles

diff --git a/orte/orted/orted_submit.c b/orte/orted/orted_submit.c
index b9d571c..0de0e79 100644
--- a/orte/orted/orted_submit.c
+++ b/orte/orted/orted_submit.c
@@ -2155,6 +2155,7 @@ static bool mpir_breakpoint_fired = false;

 static void _send_notification(void)
 {
+#if 0
     opal_buffer_t buf;
     int status = OPAL_ERR_DEBUGGER_RELEASE;
     orte_grpcomm_signature_t sig;
@@ -2209,6 +2210,7 @@ static void _send_notification(void)
     }
     OBJ_DESTRUCT(&sig);
     OBJ_DESTRUCT(&buf);
+#endif
 }

 static void orte_debugger_dump(void)



Reply via email to