Re: [OMPI devel] race condition in grpcomm/rcd

Ralph Castain Thu, 11 Sep 2014 21:54:56 -0400 (EDT)

The design is supposed to be that each node knows precisely how many daemons 
are involved in each collective, and who is going to talk to them. The 
signature contains the info required to ensure the receiver knows which 
collective this message relates to, and just happens to also allow them to 
lookup the number of daemons involved (the base function takes care of that for 
them).


So there is no need for a "pending" list - if you receive a message about a 
collective you don't yet know about, you just put it on the ongoing collective 
list. You should only receive it if you are going to be involved - i.e., you 
have local procs that are going to participate. So you wait until your local 
procs participate, and then pass your collected bucket along.

I suspect the link to the local procs isn't being correctly dealt with, else 
you couldn't be hanging. Or the rcd isn't correctly passing incoming messages 
to the base functions to register the collective.

I'll look at it over the weekend and can resolve it then.


On Sep 11, 2014, at 5:23 PM, Gilles Gouaillardet 
<gilles.gouaillar...@iferc.org> wrote:

> Ralph,
> 
> you are right, this was definetly not the right fix (at least with 4
> nodes or more)
> 
> i finally understood what is going wrong here :
> to make it simple, the allgather recursive doubling algo is not
> implemented with
> MPI_Recv(...,peer,...) like functions but with
> MPI_Recv(...,MPI_ANY_SOURCE,...) like functions
> and that makes things slightly more complicated :
> right now :
> - with two nodes : if node 1 is late, it gets stuck in the allgather
> - with four nodes : if node 0 is first, then node 2 and 3 while node 1
> is still late, then node 0
> will likely leaves the allgather though it did not receive anything
> from  node 1
> - and so on
> 
> i think i can fix that from now
> 
> Cheers,
> 
> Gilles
> 
> On 2014/09/11 23:47, Ralph Castain wrote:
>> Yeah, that's not the right fix, I'm afraid. I've made the direct component 
>> the default again until I have time to dig into this deeper.
>> 
>> On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet 
>> <gilles.gouaillar...@iferc.org> wrote:
>> 
>>> Ralph,
>>> 
>>> the root cause is when the second orted/mpirun runs rcd_finalize_coll,
>>> it does not invoke pmix_server_release
>>> because allgather_stub was not previously invoked since the the fence
>>> was not yet entered.
>>> /* in rcd_finalize_coll, coll->cbfunc is NULL */
>>> 
>>> the attached patch is likely not the right fix, it was very lightly
>>> tested, but so far, it works for me ...
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On 2014/09/11 16:11, Gilles Gouaillardet wrote:
>>>> Ralph,
>>>> 
>>>> things got worst indeed :-(
>>>> 
>>>> now a simple hello world involving two hosts hang in mpi_init.
>>>> there is still a race condition : if a tasks a call fence long after task 
>>>> b,
>>>> then task b will never leave the fence
>>>> 
>>>> i ll try to debug this ...
>>>> 
>>>> Cheers,
>>>> 
>>>> Gilles
>>>> 
>>>> On 2014/09/11 2:36, Ralph Castain wrote:
>>>>> I think I now have this fixed - let me know what you see.
>>>>> 
>>>>> 
>>>>> On Sep 9, 2014, at 6:15 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>> 
>>>>>> Yeah, that's not the correct fix. The right way to fix it is for all 
>>>>>> three components to have their own RML tag, and for each of them to 
>>>>>> establish a persistent receive. They then can use the signature to tell 
>>>>>> which collective the incoming message belongs to.
>>>>>> 
>>>>>> I'll fix it, but it won't be until tomorrow I'm afraid as today is shot.
>>>>>> 
>>>>>> 
>>>>>> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet 
>>>>>> <gilles.gouaillar...@iferc.org> wrote:
>>>>>> 
>>>>>>> Folks,
>>>>>>> 
>>>>>>> Since r32672 (trunk), grpcomm/rcd is the default module.
>>>>>>> the attached spawn.c test program is a trimmed version of the
>>>>>>> spawn_with_env_vars.c test case
>>>>>>> from the ibm test suite.
>>>>>>> 
>>>>>>> when invoked on two nodes :
>>>>>>> - the program hangs with -np 2
>>>>>>> - the program can crash with np > 2
>>>>>>> error message is
>>>>>>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1]
>>>>>>> AND TAG -33 - ABORTING
>>>>>>> 
>>>>>>> here is my full command line (from node0) :
>>>>>>> 
>>>>>>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca
>>>>>>> coll ^ml ./spawn
>>>>>>> 
>>>>>>> a simple workaround is to add the following extra parameter to the
>>>>>>> mpirun command line :
>>>>>>> --mca grpcomm_rcd_priority 0
>>>>>>> 
>>>>>>> my understanding it that the race condition occurs when all the
>>>>>>> processes call MPI_Finalize()
>>>>>>> internally, the pmix module will have mpirun/orted issue two ALLGATHER
>>>>>>> involving mpirun and orted
>>>>>>> (one job 1 aka the parent, and one for job 2 aka the spawned tasks)
>>>>>>> the error message is very explicit : this is not (currently) supported
>>>>>>> 
>>>>>>> i wrote the attached rml.patch which is really a workaround and not a 
>>>>>>> fix :
>>>>>>> in this case, each job will invoke an ALLGATHER but with a different tag
>>>>>>> /* that works for a limited number of jobs only */
>>>>>>> 
>>>>>>> i did not commit this patch since this is not a fix, could someone
>>>>>>> (Ralph ?) please review the issue and comment ?
>>>>>>> 
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> 
>>>>>>> Gilles
>>>>>>> 
>>>>>>> <spawn.c><rml.patch>_______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15794.php
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15804.php
>>> <rml2.patch>_______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/09/15805.php
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/09/15810.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15814.php

Re: [OMPI devel] race condition in grpcomm/rcd

Reply via email to