I can fix the initialization. What puzzles me is that no debugger_release message should be sent unless a debugger is attached - in which case, the event should be registered.
So why is it being sent? Is it the child job that is receiving it? Or is it the parent? > On Jul 16, 2016, at 7:19 AM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > > I found some time to investigate this. > tscon should initialize nondefault to false in both pmix2x.c and pmix_ext20.c > > A better workaround is to update ompi_errhandler_callback, so it does not > invoke ompi_mpi_abort if status is OPAL_ERR_DEBUGGER_RELEASE > > That still seems counter intuitive to me ... > Does ERR stands for error ? I did not find any error here ... > Should it be EVT for event instead ? Should ERR not be fired in the first > place ? > Should Open MPI register a handler for this event (so nondefault is true and > ompi_errhandler_callback is not invoked here) ? > > Cheers, > > Gilles > > On Friday, July 15, 2016, Ralph Castain <r...@open-mpi.org > <mailto:r...@open-mpi.org>> wrote: > Okay, I’ll take a look - thanks! > >> On Jul 15, 2016, at 7:08 AM, Gilles Gouaillardet >> <gilles.gouaillar...@gmail.com >> <javascript:_e(%7B%7D,'cvml','gilles.gouaillar...@gmail.com');>> wrote: >> >> >> Yep, >> >> The constructor of pmix2x_threadshift_t (tscon) does not initialize >> nondefault to false. >> I won't be able to investigate this until Monday, but so far, my guess is >> that if the constructor is fixed, then RHEL6 will fail like RHEL7 ... >> >> fwiw, the intercomm_create used to fail in Cisco mtt because of too many >> tasks and no over subscription, now it fails because of this bug. >> >> Cheers, >> >> Gilles >> >> On Friday, July 15, 2016, Ralph Castain <r...@open-mpi.org >> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote: >> That would break debugger attach. Sounds to me like it’s just an >> uninitialized variable for in_event_hdlr? >> >> > On Jul 15, 2016, at 1:20 AM, Gilles Gouaillardet <gil...@rist.or.jp <>> >> > wrote: >> > >> > Ralph, >> > >> > i noticed MPI_Comm_spawn is broken on master and on RHEL7 >> > >> > for some reason i cannot yet explain, it works just fine on RHEL6 (!) >> > >> > >> > mpirun -np 1 ./dynamic/intercomm_create >> > >> > from the ibm test suite can be used to reproduce the issue. >> > >> > >> > >> > i digged a bit and i found OPAL_ERR_DEBUGGER_RELEASE is fired in mpirun, >> > then the tasks received >> > >> > a PMIX_ERR_DEBUGGER_RELEASE notification. it seems no event handler is >> > registered, so the default handler >> > >> > kills the task. >> > >> > >> > for the time being, a trivial workaround is not to fire >> > OPAL_ERR_DEBUGGER_RELEASE in the first place >> > >> > (see patch below) >> > >> > >> > could you please have a look ? >> > >> > i am not sure whether client should not be notified at all, or whether >> > they should register a dummy handler. >> > >> > fwiw, in _event_hdlr, cd->nondefault is true on RHEL6, but false on RHEL7, >> > and that might indicate a race condition >> > >> > >> > Cheers, >> > >> > >> > Gilles >> > >> > diff --git a/orte/orted/orted_submit.c b/orte/orted/orted_submit.c >> > index b9d571c..0de0e79 100644 >> > --- a/orte/orted/orted_submit.c >> > +++ b/orte/orted/orted_submit.c >> > @@ -2155,6 +2155,7 @@ static bool mpir_breakpoint_fired = false; >> > >> > static void _send_notification(void) >> > { >> > +#if 0 >> > opal_buffer_t buf; >> > int status = OPAL_ERR_DEBUGGER_RELEASE; >> > orte_grpcomm_signature_t sig; >> > @@ -2209,6 +2210,7 @@ static void _send_notification(void) >> > } >> > OBJ_DESTRUCT(&sig); >> > OBJ_DESTRUCT(&buf); >> > +#endif >> > } >> > >> > static void orte_debugger_dump(void) >> > >> > >> > >> > _______________________________________________ >> > devel mailing list >> > de...@open-mpi.org <> >> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >> > <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >> > Link to this post: >> > http://www.open-mpi.org/community/lists/devel/2016/07/19214.php >> > <http://www.open-mpi.org/community/lists/devel/2016/07/19214.php> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/07/19215.php >> <http://www.open-mpi.org/community/lists/devel/2016/07/19215.php>_______________________________________________ >> devel mailing list >> de...@open-mpi.org <javascript:_e(%7B%7D,'cvml','de...@open-mpi.org');> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/07/19216.php >> <http://www.open-mpi.org/community/lists/devel/2016/07/19216.php> > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/07/19220.php