I found some time to investigate this.
tscon should initialize nondefault to false in both pmix2x.c and
pmix_ext20.c

A better workaround is to update ompi_errhandler_callback, so it does not
invoke ompi_mpi_abort if status is OPAL_ERR_DEBUGGER_RELEASE

That still seems counter intuitive to me ...
Does ERR stands for error ? I did not find any error here ...
Should it be EVT for event instead ? Should ERR not be fired in the first
place ?
Should Open MPI register a handler for this event (so nondefault is true
and ompi_errhandler_callback is not invoked here) ?

Cheers,

Gilles

On Friday, July 15, 2016, Ralph Castain <r...@open-mpi.org> wrote:

> Okay, I’ll take a look - thanks!
>
> On Jul 15, 2016, at 7:08 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com
> <javascript:_e(%7B%7D,'cvml','gilles.gouaillar...@gmail.com');>> wrote:
>
>
> Yep,
>
> The constructor of pmix2x_threadshift_t (tscon) does not initialize
> nondefault to false.
> I won't be able to investigate this until Monday, but so far, my guess is
> that if the constructor is fixed, then RHEL6 will fail like RHEL7 ...
>
> fwiw, the intercomm_create used to fail in Cisco mtt because of too many
> tasks and no over subscription, now it fails because of this bug.
>
> Cheers,
>
> Gilles
>
> On Friday, July 15, 2016, Ralph Castain <r...@open-mpi.org
> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote:
>
>> That would break debugger attach. Sounds to me like it’s just an
>> uninitialized variable for in_event_hdlr?
>>
>> > On Jul 15, 2016, at 1:20 AM, Gilles Gouaillardet <gil...@rist.or.jp>
>> wrote:
>> >
>> > Ralph,
>> >
>> > i noticed MPI_Comm_spawn is broken on master and on RHEL7
>> >
>> > for some reason i cannot yet explain, it works just fine on RHEL6 (!)
>> >
>> >
>> > mpirun -np 1 ./dynamic/intercomm_create
>> >
>> > from the ibm test suite can be used to reproduce the issue.
>> >
>> >
>> >
>> > i digged a bit and i found OPAL_ERR_DEBUGGER_RELEASE is fired in
>> mpirun, then the tasks received
>> >
>> > a PMIX_ERR_DEBUGGER_RELEASE notification. it seems no event handler is
>> registered, so the default handler
>> >
>> > kills the task.
>> >
>> >
>> > for the time being, a trivial workaround is not to fire
>> OPAL_ERR_DEBUGGER_RELEASE in the first place
>> >
>> > (see patch below)
>> >
>> >
>> > could you please have a look ?
>> >
>> > i am not sure whether client should not be notified at all, or whether
>> they should register a dummy handler.
>> >
>> > fwiw, in _event_hdlr, cd->nondefault is true on RHEL6, but false on
>> RHEL7, and that might indicate a race condition
>> >
>> >
>> > Cheers,
>> >
>> >
>> > Gilles
>> >
>> > diff --git a/orte/orted/orted_submit.c b/orte/orted/orted_submit.c
>> > index b9d571c..0de0e79 100644
>> > --- a/orte/orted/orted_submit.c
>> > +++ b/orte/orted/orted_submit.c
>> > @@ -2155,6 +2155,7 @@ static bool mpir_breakpoint_fired = false;
>> >
>> > static void _send_notification(void)
>> > {
>> > +#if 0
>> >     opal_buffer_t buf;
>> >     int status = OPAL_ERR_DEBUGGER_RELEASE;
>> >     orte_grpcomm_signature_t sig;
>> > @@ -2209,6 +2210,7 @@ static void _send_notification(void)
>> >     }
>> >     OBJ_DESTRUCT(&sig);
>> >     OBJ_DESTRUCT(&buf);
>> > +#endif
>> > }
>> >
>> > static void orte_debugger_dump(void)
>> >
>> >
>> >
>> > _______________________________________________
>> > devel mailing list
>> > de...@open-mpi.org
>> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> > Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/07/19214.php
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/07/19215.php
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org <javascript:_e(%7B%7D,'cvml','de...@open-mpi.org');>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/07/19216.php
>
>
>

Reply via email to