I finally got it :-)

in send_notification() from orted_submit.c, info is
OPAL_PMIX_EVENT_NON_DEFAULT, but in pmix2x.c and
pmix_ext20.c, PMIX_EVENT_NON_DEFAULT is tested.
If I use OPAL_PMIX_EVENT_NON_DEFAULT in pmix*, that fixes the issue

Cheers,

Gilles

On Sunday, July 17, 2016, Ralph Castain <r...@open-mpi.org> wrote:

> Okay, I’ll investigate why that is happening - thanks!
>
> On Jul 16, 2016, at 7:45 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com
> <javascript:_e(%7B%7D,'cvml','gilles.gouaillar...@gmail.com');>> wrote:
>
> The parent job (e.g.  the task that calls MPI_Comm_spawn) receives it.
> I cannot tell whether the child (e.g. the spawned task) receives it too or
> not
>
> Cheers,
>
> Gilles
>
> On Saturday, July 16, 2016, Ralph Castain <r...@open-mpi.org
> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote:
>
>> I can fix the initialization. What puzzles me is that no debugger_release
>> message should be sent unless a debugger is attached - in which case, the
>> event should be registered.
>>
>> So why is it being sent? Is it the child job that is receiving it? Or is
>> it the parent?
>>
>>
>> On Jul 16, 2016, at 7:19 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com> wrote:
>>
>> I found some time to investigate this.
>> tscon should initialize nondefault to false in both pmix2x.c and
>> pmix_ext20.c
>>
>> A better workaround is to update ompi_errhandler_callback, so it does not
>> invoke ompi_mpi_abort if status is OPAL_ERR_DEBUGGER_RELEASE
>>
>> That still seems counter intuitive to me ...
>> Does ERR stands for error ? I did not find any error here ...
>> Should it be EVT for event instead ? Should ERR not be fired in the first
>> place ?
>> Should Open MPI register a handler for this event (so nondefault is true
>> and ompi_errhandler_callback is not invoked here) ?
>>
>> Cheers,
>>
>> Gilles
>>
>> On Friday, July 15, 2016, Ralph Castain <r...@open-mpi.org> wrote:
>>
>>> Okay, I’ll take a look - thanks!
>>>
>>> On Jul 15, 2016, at 7:08 AM, Gilles Gouaillardet <
>>> gilles.gouaillar...@gmail.com> wrote:
>>>
>>>
>>> Yep,
>>>
>>> The constructor of pmix2x_threadshift_t (tscon) does not initialize
>>> nondefault to false.
>>> I won't be able to investigate this until Monday, but so far, my guess
>>> is that if the constructor is fixed, then RHEL6 will fail like RHEL7 ...
>>>
>>> fwiw, the intercomm_create used to fail in Cisco mtt because of too many
>>> tasks and no over subscription, now it fails because of this bug.
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Friday, July 15, 2016, Ralph Castain <r...@open-mpi.org> wrote:
>>>
>>>> That would break debugger attach. Sounds to me like it’s just an
>>>> uninitialized variable for in_event_hdlr?
>>>>
>>>> > On Jul 15, 2016, at 1:20 AM, Gilles Gouaillardet <gil...@rist.or.jp>
>>>> wrote:
>>>> >
>>>> > Ralph,
>>>> >
>>>> > i noticed MPI_Comm_spawn is broken on master and on RHEL7
>>>> >
>>>> > for some reason i cannot yet explain, it works just fine on RHEL6 (!)
>>>> >
>>>> >
>>>> > mpirun -np 1 ./dynamic/intercomm_create
>>>> >
>>>> > from the ibm test suite can be used to reproduce the issue.
>>>> >
>>>> >
>>>> >
>>>> > i digged a bit and i found OPAL_ERR_DEBUGGER_RELEASE is fired in
>>>> mpirun, then the tasks received
>>>> >
>>>> > a PMIX_ERR_DEBUGGER_RELEASE notification. it seems no event handler
>>>> is registered, so the default handler
>>>> >
>>>> > kills the task.
>>>> >
>>>> >
>>>> > for the time being, a trivial workaround is not to fire
>>>> OPAL_ERR_DEBUGGER_RELEASE in the first place
>>>> >
>>>> > (see patch below)
>>>> >
>>>> >
>>>> > could you please have a look ?
>>>> >
>>>> > i am not sure whether client should not be notified at all, or
>>>> whether they should register a dummy handler.
>>>> >
>>>> > fwiw, in _event_hdlr, cd->nondefault is true on RHEL6, but false on
>>>> RHEL7, and that might indicate a race condition
>>>> >
>>>> >
>>>> > Cheers,
>>>> >
>>>> >
>>>> > Gilles
>>>> >
>>>> > diff --git a/orte/orted/orted_submit.c b/orte/orted/orted_submit.c
>>>> > index b9d571c..0de0e79 100644
>>>> > --- a/orte/orted/orted_submit.c
>>>> > +++ b/orte/orted/orted_submit.c
>>>> > @@ -2155,6 +2155,7 @@ static bool mpir_breakpoint_fired = false;
>>>> >
>>>> > static void _send_notification(void)
>>>> > {
>>>> > +#if 0
>>>> >     opal_buffer_t buf;
>>>> >     int status = OPAL_ERR_DEBUGGER_RELEASE;
>>>> >     orte_grpcomm_signature_t sig;
>>>> > @@ -2209,6 +2210,7 @@ static void _send_notification(void)
>>>> >     }
>>>> >     OBJ_DESTRUCT(&sig);
>>>> >     OBJ_DESTRUCT(&buf);
>>>> > +#endif
>>>> > }
>>>> >
>>>> > static void orte_debugger_dump(void)
>>>> >
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > devel mailing list
>>>> > de...@open-mpi.org
>>>> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> > Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2016/07/19214.php
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2016/07/19215.php
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2016/07/19216.php
>>>
>>>
>>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/07/19220.php
>>
>>
>> _______________________________________________
> devel mailing list
> de...@open-mpi.org <javascript:_e(%7B%7D,'cvml','de...@open-mpi.org');>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/07/19222.php
>
>
>

Reply via email to