Re: [OMPI devel] race condition in grpcomm/rcd

Ralph Castain Fri, 12 Sep 2014 09:32:41 -0400 (EDT)

On Sep 12, 2014, at 5:45 AM, Gilles Gouaillardet 
<gilles.gouaillar...@gmail.com> wrote:


> Ralph,
> 
> On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain <r...@open-mpi.org> wrote:
> The design is supposed to be that each node knows precisely how many daemons 
> are involved in each collective, and who is going to talk to them.
> 
> ok, but in the design does not ensure that things will happen in the right 
> order :
> - enter the allgather
> - receive data from the daemon at distance 1
> - receive data from the daemon at distance 2
> - and so on
> 
> with current implementation when 2 daemons are involved, if a daemon enters 
> the allgather after it received data from the peer, then the mpi processes 
> local to this daemon will hang
> 
> with 4 nodes, it got trickier :
> 0 enter allgather and send a message to 1
> 1 receive the message and send to 2 but with data from 0 only
> /* 1 did not enter the allgather, so its data cannot be sent to 2 */

It's just a bug in the rcd logic, Gilles. I'll take a look and get it fixed - 
just don't have time right now

> 
> this issue did not occur before the persistent receive :
> no receive was posted if the daemon did not enter the allgather 
> 
> 
> The signature contains the info required to ensure the receiver knows which 
> collective this message relates to, and just happens to also allow them to 
> lookup the number of daemons involved (the base function takes care of that 
> for them).
> 
>  
> ok too, this issue was solved with the persistent receive
> 
> So there is no need for a "pending" list - if you receive a message about a 
> collective you don't yet know about, you just put it on the ongoing 
> collective list. You should only receive it if you are going to be involved - 
> i.e., you have local procs that are going to participate. So you wait until 
> your local procs participate, and then pass your collected bucket along.
> 
> ok, i did something similar
> (e.g. pass all the available data)
> some data might be passed twice, but that might not be an issue
>  
> I suspect the link to the local procs isn't being correctly dealt with, else 
> you couldn't be hanging. Or the rcd isn't correctly passing incoming messages 
> to the base functions to register the collective.
> 
> I'll look at it over the weekend and can resolve it then.
> 
> 
>  the attached patch is an illustration of what i was trying to explain.
> coll->nreported is used by rcd as a bitmask of the received messages
> (bit 0 is for the local daemon, bit n for the daemon at distance n)
> 
> i was still debugging a race condition :
> if daemons 2 and 3 enter the allgather at the send time, they will sent a 
> message to each other at the same time and rml fails establishing the 
> connection.  i could not find whether this is linked to my changes...
> 
> Cheers,
> 
> Gilles
> 
> On Sep 11, 2014, at 5:23 PM, Gilles Gouaillardet 
> <gilles.gouaillar...@iferc.org> wrote:
> 
> > Ralph,
> >
> > you are right, this was definetly not the right fix (at least with 4
> > nodes or more)
> >
> > i finally understood what is going wrong here :
> > to make it simple, the allgather recursive doubling algo is not
> > implemented with
> > MPI_Recv(...,peer,...) like functions but with
> > MPI_Recv(...,MPI_ANY_SOURCE,...) like functions
> > and that makes things slightly more complicated :
> > right now :
> > - with two nodes : if node 1 is late, it gets stuck in the allgather
> > - with four nodes : if node 0 is first, then node 2 and 3 while node 1
> > is still late, then node 0
> > will likely leaves the allgather though it did not receive anything
> > from  node 1
> > - and so on
> >
> > i think i can fix that from now
> >
> > Cheers,
> >
> > Gilles
> >
> > On 2014/09/11 23:47, Ralph Castain wrote:
> >> Yeah, that's not the right fix, I'm afraid. I've made the direct component 
> >> the default again until I have time to dig into this deeper.
> >>
> >> On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet 
> >> <gilles.gouaillar...@iferc.org> wrote:
> >>
> >>> Ralph,
> >>>
> >>> the root cause is when the second orted/mpirun runs rcd_finalize_coll,
> >>> it does not invoke pmix_server_release
> >>> because allgather_stub was not previously invoked since the the fence
> >>> was not yet entered.
> >>> /* in rcd_finalize_coll, coll->cbfunc is NULL */
> >>>
> >>> the attached patch is likely not the right fix, it was very lightly
> >>> tested, but so far, it works for me ...
> >>>
> >>> Cheers,
> >>>
> >>> Gilles
> >>>
> >>> On 2014/09/11 16:11, Gilles Gouaillardet wrote:
> >>>> Ralph,
> >>>>
> >>>> things got worst indeed :-(
> >>>>
> >>>> now a simple hello world involving two hosts hang in mpi_init.
> >>>> there is still a race condition : if a tasks a call fence long after 
> >>>> task b,
> >>>> then task b will never leave the fence
> >>>>
> >>>> i ll try to debug this ...
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Gilles
> >>>>
> >>>> On 2014/09/11 2:36, Ralph Castain wrote:
> >>>>> I think I now have this fixed - let me know what you see.
> >>>>>
> >>>>>
> >>>>> On Sep 9, 2014, at 6:15 AM, Ralph Castain <r...@open-mpi.org> wrote:
> >>>>>
> >>>>>> Yeah, that's not the correct fix. The right way to fix it is for all 
> >>>>>> three components to have their own RML tag, and for each of them to 
> >>>>>> establish a persistent receive. They then can use the signature to 
> >>>>>> tell which collective the incoming message belongs to.
> >>>>>>
> >>>>>> I'll fix it, but it won't be until tomorrow I'm afraid as today is 
> >>>>>> shot.
> >>>>>>
> >>>>>>
> >>>>>> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet 
> >>>>>> <gilles.gouaillar...@iferc.org> wrote:
> >>>>>>
> >>>>>>> Folks,
> >>>>>>>
> >>>>>>> Since r32672 (trunk), grpcomm/rcd is the default module.
> >>>>>>> the attached spawn.c test program is a trimmed version of the
> >>>>>>> spawn_with_env_vars.c test case
> >>>>>>> from the ibm test suite.
> >>>>>>>
> >>>>>>> when invoked on two nodes :
> >>>>>>> - the program hangs with -np 2
> >>>>>>> - the program can crash with np > 2
> >>>>>>> error message is
> >>>>>>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1]
> >>>>>>> AND TAG -33 - ABORTING
> >>>>>>>
> >>>>>>> here is my full command line (from node0) :
> >>>>>>>
> >>>>>>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self 
> >>>>>>> --mca
> >>>>>>> coll ^ml ./spawn
> >>>>>>>
> >>>>>>> a simple workaround is to add the following extra parameter to the
> >>>>>>> mpirun command line :
> >>>>>>> --mca grpcomm_rcd_priority 0
> >>>>>>>
> >>>>>>> my understanding it that the race condition occurs when all the
> >>>>>>> processes call MPI_Finalize()
> >>>>>>> internally, the pmix module will have mpirun/orted issue two ALLGATHER
> >>>>>>> involving mpirun and orted
> >>>>>>> (one job 1 aka the parent, and one for job 2 aka the spawned tasks)
> >>>>>>> the error message is very explicit : this is not (currently) supported
> >>>>>>>
> >>>>>>> i wrote the attached rml.patch which is really a workaround and not a 
> >>>>>>> fix :
> >>>>>>> in this case, each job will invoke an ALLGATHER but with a different 
> >>>>>>> tag
> >>>>>>> /* that works for a limited number of jobs only */
> >>>>>>>
> >>>>>>> i did not commit this patch since this is not a fix, could someone
> >>>>>>> (Ralph ?) please review the issue and comment ?
> >>>>>>>
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>>
> >>>>>>> Gilles
> >>>>>>>
> >>>>>>> <spawn.c><rml.patch>_______________________________________________
> >>>>>>> devel mailing list
> >>>>>>> de...@open-mpi.org
> >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>>>> Link to this post: 
> >>>>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php
> >>>>> _______________________________________________
> >>>>> devel mailing list
> >>>>> de...@open-mpi.org
> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>> Link to this post: 
> >>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15794.php
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> de...@open-mpi.org
> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>> Link to this post: 
> >>>> http://www.open-mpi.org/community/lists/devel/2014/09/15804.php
> >>> <rml2.patch>_______________________________________________
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> Link to this post: 
> >>> http://www.open-mpi.org/community/lists/devel/2014/09/15805.php
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post: 
> >> http://www.open-mpi.org/community/lists/devel/2014/09/15810.php
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/09/15814.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15815.php
> 
> <rml3.patch>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15816.php

Re: [OMPI devel] race condition in grpcomm/rcd

Reply via email to