I finally got it :-) in send_notification() from orted_submit.c, info is OPAL_PMIX_EVENT_NON_DEFAULT, but in pmix2x.c and pmix_ext20.c, PMIX_EVENT_NON_DEFAULT is tested. If I use OPAL_PMIX_EVENT_NON_DEFAULT in pmix*, that fixes the issue
Cheers, Gilles On Sunday, July 17, 2016, Ralph Castain <r...@open-mpi.org> wrote: > Okay, I’ll investigate why that is happening - thanks! > > On Jul 16, 2016, at 7:45 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com > <javascript:_e(%7B%7D,'cvml','gilles.gouaillar...@gmail.com');>> wrote: > > The parent job (e.g. the task that calls MPI_Comm_spawn) receives it. > I cannot tell whether the child (e.g. the spawned task) receives it too or > not > > Cheers, > > Gilles > > On Saturday, July 16, 2016, Ralph Castain <r...@open-mpi.org > <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote: > >> I can fix the initialization. What puzzles me is that no debugger_release >> message should be sent unless a debugger is attached - in which case, the >> event should be registered. >> >> So why is it being sent? Is it the child job that is receiving it? Or is >> it the parent? >> >> >> On Jul 16, 2016, at 7:19 AM, Gilles Gouaillardet < >> gilles.gouaillar...@gmail.com> wrote: >> >> I found some time to investigate this. >> tscon should initialize nondefault to false in both pmix2x.c and >> pmix_ext20.c >> >> A better workaround is to update ompi_errhandler_callback, so it does not >> invoke ompi_mpi_abort if status is OPAL_ERR_DEBUGGER_RELEASE >> >> That still seems counter intuitive to me ... >> Does ERR stands for error ? I did not find any error here ... >> Should it be EVT for event instead ? Should ERR not be fired in the first >> place ? >> Should Open MPI register a handler for this event (so nondefault is true >> and ompi_errhandler_callback is not invoked here) ? >> >> Cheers, >> >> Gilles >> >> On Friday, July 15, 2016, Ralph Castain <r...@open-mpi.org> wrote: >> >>> Okay, I’ll take a look - thanks! >>> >>> On Jul 15, 2016, at 7:08 AM, Gilles Gouaillardet < >>> gilles.gouaillar...@gmail.com> wrote: >>> >>> >>> Yep, >>> >>> The constructor of pmix2x_threadshift_t (tscon) does not initialize >>> nondefault to false. >>> I won't be able to investigate this until Monday, but so far, my guess >>> is that if the constructor is fixed, then RHEL6 will fail like RHEL7 ... >>> >>> fwiw, the intercomm_create used to fail in Cisco mtt because of too many >>> tasks and no over subscription, now it fails because of this bug. >>> >>> Cheers, >>> >>> Gilles >>> >>> On Friday, July 15, 2016, Ralph Castain <r...@open-mpi.org> wrote: >>> >>>> That would break debugger attach. Sounds to me like it’s just an >>>> uninitialized variable for in_event_hdlr? >>>> >>>> > On Jul 15, 2016, at 1:20 AM, Gilles Gouaillardet <gil...@rist.or.jp> >>>> wrote: >>>> > >>>> > Ralph, >>>> > >>>> > i noticed MPI_Comm_spawn is broken on master and on RHEL7 >>>> > >>>> > for some reason i cannot yet explain, it works just fine on RHEL6 (!) >>>> > >>>> > >>>> > mpirun -np 1 ./dynamic/intercomm_create >>>> > >>>> > from the ibm test suite can be used to reproduce the issue. >>>> > >>>> > >>>> > >>>> > i digged a bit and i found OPAL_ERR_DEBUGGER_RELEASE is fired in >>>> mpirun, then the tasks received >>>> > >>>> > a PMIX_ERR_DEBUGGER_RELEASE notification. it seems no event handler >>>> is registered, so the default handler >>>> > >>>> > kills the task. >>>> > >>>> > >>>> > for the time being, a trivial workaround is not to fire >>>> OPAL_ERR_DEBUGGER_RELEASE in the first place >>>> > >>>> > (see patch below) >>>> > >>>> > >>>> > could you please have a look ? >>>> > >>>> > i am not sure whether client should not be notified at all, or >>>> whether they should register a dummy handler. >>>> > >>>> > fwiw, in _event_hdlr, cd->nondefault is true on RHEL6, but false on >>>> RHEL7, and that might indicate a race condition >>>> > >>>> > >>>> > Cheers, >>>> > >>>> > >>>> > Gilles >>>> > >>>> > diff --git a/orte/orted/orted_submit.c b/orte/orted/orted_submit.c >>>> > index b9d571c..0de0e79 100644 >>>> > --- a/orte/orted/orted_submit.c >>>> > +++ b/orte/orted/orted_submit.c >>>> > @@ -2155,6 +2155,7 @@ static bool mpir_breakpoint_fired = false; >>>> > >>>> > static void _send_notification(void) >>>> > { >>>> > +#if 0 >>>> > opal_buffer_t buf; >>>> > int status = OPAL_ERR_DEBUGGER_RELEASE; >>>> > orte_grpcomm_signature_t sig; >>>> > @@ -2209,6 +2210,7 @@ static void _send_notification(void) >>>> > } >>>> > OBJ_DESTRUCT(&sig); >>>> > OBJ_DESTRUCT(&buf); >>>> > +#endif >>>> > } >>>> > >>>> > static void orte_debugger_dump(void) >>>> > >>>> > >>>> > >>>> > _______________________________________________ >>>> > devel mailing list >>>> > de...@open-mpi.org >>>> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> > Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2016/07/19214.php >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2016/07/19215.php >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2016/07/19216.php >>> >>> >>> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/07/19220.php >> >> >> _______________________________________________ > devel mailing list > de...@open-mpi.org <javascript:_e(%7B%7D,'cvml','de...@open-mpi.org');> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/07/19222.php > > >