Re: [OMPI devel] race condition in grpcomm/rcd

Joshua Ladd Fri, 12 Sep 2014 10:06:46 -0400 (EDT)

Let me know if Nadia can help here, Ralph.

Josh



On Fri, Sep 12, 2014 at 9:31 AM, Ralph Castain <r...@open-mpi.org> wrote:

>
> On Sep 12, 2014, at 5:45 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
> Ralph,
>
> On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> The design is supposed to be that each node knows precisely how many
>> daemons are involved in each collective, and who is going to talk to them.
>
>
> ok, but in the design does not ensure that things will happen in the right
> order :
> - enter the allgather
> - receive data from the daemon at distance 1
> - receive data from the daemon at distance 2
> - and so on
>
> with current implementation when 2 daemons are involved, if a daemon
> enters the allgather after it received data from the peer, then the mpi
> processes local to this daemon will hang
>
> with 4 nodes, it got trickier :
> 0 enter allgather and send a message to 1
> 1 receive the message and send to 2 but with data from 0 only
> /* 1 did not enter the allgather, so its data cannot be sent to 2 */
>
>
> It's just a bug in the rcd logic, Gilles. I'll take a look and get it
> fixed - just don't have time right now
>
>
> this issue did not occur before the persistent receive :
> no receive was posted if the daemon did not enter the allgather
>
>
> The signature contains the info required to ensure the receiver knows
>> which collective this message relates to, and just happens to also allow
>> them to lookup the number of daemons involved (the base function takes care
>> of that for them).
>>
>>
> ok too, this issue was solved with the persistent receive
>
> So there is no need for a "pending" list - if you receive a message about
>> a collective you don't yet know about, you just put it on the ongoing
>> collective list. You should only receive it if you are going to be involved
>> - i.e., you have local procs that are going to participate. So you wait
>> until your local procs participate, and then pass your collected bucket
>> along.
>>
>> ok, i did something similar
> (e.g. pass all the available data)
> some data might be passed twice, but that might not be an issue
>
>
>> I suspect the link to the local procs isn't being correctly dealt with,
>> else you couldn't be hanging. Or the rcd isn't correctly passing incoming
>> messages to the base functions to register the collective.
>>
>> I'll look at it over the weekend and can resolve it then.
>>
>>
>  the attached patch is an illustration of what i was trying to explain.
> coll->nreported is used by rcd as a bitmask of the received messages
> (bit 0 is for the local daemon, bit n for the daemon at distance n)
>
> i was still debugging a race condition :
> if daemons 2 and 3 enter the allgather at the send time, they will sent a
> message to each other at the same time and rml fails establishing the
> connection.  i could not find whether this is linked to my changes...
>
> Cheers,
>
> Gilles
>
>>
>> On Sep 11, 2014, at 5:23 PM, Gilles Gouaillardet <
>> gilles.gouaillar...@iferc.org> wrote:
>>
>> > Ralph,
>> >
>> > you are right, this was definetly not the right fix (at least with 4
>> > nodes or more)
>> >
>> > i finally understood what is going wrong here :
>> > to make it simple, the allgather recursive doubling algo is not
>> > implemented with
>> > MPI_Recv(...,peer,...) like functions but with
>> > MPI_Recv(...,MPI_ANY_SOURCE,...) like functions
>> > and that makes things slightly more complicated :
>> > right now :
>> > - with two nodes : if node 1 is late, it gets stuck in the allgather
>> > - with four nodes : if node 0 is first, then node 2 and 3 while node 1
>> > is still late, then node 0
>> > will likely leaves the allgather though it did not receive anything
>> > from  node 1
>> > - and so on
>> >
>> > i think i can fix that from now
>> >
>> > Cheers,
>> >
>> > Gilles
>> >
>> > On 2014/09/11 23:47, Ralph Castain wrote:
>> >> Yeah, that's not the right fix, I'm afraid. I've made the direct
>> component the default again until I have time to dig into this deeper.
>> >>
>> >> On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@iferc.org> wrote:
>> >>
>> >>> Ralph,
>> >>>
>> >>> the root cause is when the second orted/mpirun runs rcd_finalize_coll,
>> >>> it does not invoke pmix_server_release
>> >>> because allgather_stub was not previously invoked since the the fence
>> >>> was not yet entered.
>> >>> /* in rcd_finalize_coll, coll->cbfunc is NULL */
>> >>>
>> >>> the attached patch is likely not the right fix, it was very lightly
>> >>> tested, but so far, it works for me ...
>> >>>
>> >>> Cheers,
>> >>>
>> >>> Gilles
>> >>>
>> >>> On 2014/09/11 16:11, Gilles Gouaillardet wrote:
>> >>>> Ralph,
>> >>>>
>> >>>> things got worst indeed :-(
>> >>>>
>> >>>> now a simple hello world involving two hosts hang in mpi_init.
>> >>>> there is still a race condition : if a tasks a call fence long after
>> task b,
>> >>>> then task b will never leave the fence
>> >>>>
>> >>>> i ll try to debug this ...
>> >>>>
>> >>>> Cheers,
>> >>>>
>> >>>> Gilles
>> >>>>
>> >>>> On 2014/09/11 2:36, Ralph Castain wrote:
>> >>>>> I think I now have this fixed - let me know what you see.
>> >>>>>
>> >>>>>
>> >>>>> On Sep 9, 2014, at 6:15 AM, Ralph Castain <r...@open-mpi.org> wrote:
>> >>>>>
>> >>>>>> Yeah, that's not the correct fix. The right way to fix it is for
>> all three components to have their own RML tag, and for each of them to
>> establish a persistent receive. They then can use the signature to tell
>> which collective the incoming message belongs to.
>> >>>>>>
>> >>>>>> I'll fix it, but it won't be until tomorrow I'm afraid as today is
>> shot.
>> >>>>>>
>> >>>>>>
>> >>>>>> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@iferc.org> wrote:
>> >>>>>>
>> >>>>>>> Folks,
>> >>>>>>>
>> >>>>>>> Since r32672 (trunk), grpcomm/rcd is the default module.
>> >>>>>>> the attached spawn.c test program is a trimmed version of the
>> >>>>>>> spawn_with_env_vars.c test case
>> >>>>>>> from the ibm test suite.
>> >>>>>>>
>> >>>>>>> when invoked on two nodes :
>> >>>>>>> - the program hangs with -np 2
>> >>>>>>> - the program can crash with np > 2
>> >>>>>>> error message is
>> >>>>>>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER
>> [[42913,0],1]
>> >>>>>>> AND TAG -33 - ABORTING
>> >>>>>>>
>> >>>>>>> here is my full command line (from node0) :
>> >>>>>>>
>> >>>>>>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self
>> --mca
>> >>>>>>> coll ^ml ./spawn
>> >>>>>>>
>> >>>>>>> a simple workaround is to add the following extra parameter to the
>> >>>>>>> mpirun command line :
>> >>>>>>> --mca grpcomm_rcd_priority 0
>> >>>>>>>
>> >>>>>>> my understanding it that the race condition occurs when all the
>> >>>>>>> processes call MPI_Finalize()
>> >>>>>>> internally, the pmix module will have mpirun/orted issue two
>> ALLGATHER
>> >>>>>>> involving mpirun and orted
>> >>>>>>> (one job 1 aka the parent, and one for job 2 aka the spawned
>> tasks)
>> >>>>>>> the error message is very explicit : this is not (currently)
>> supported
>> >>>>>>>
>> >>>>>>> i wrote the attached rml.patch which is really a workaround and
>> not a fix :
>> >>>>>>> in this case, each job will invoke an ALLGATHER but with a
>> different tag
>> >>>>>>> /* that works for a limited number of jobs only */
>> >>>>>>>
>> >>>>>>> i did not commit this patch since this is not a fix, could someone
>> >>>>>>> (Ralph ?) please review the issue and comment ?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Cheers,
>> >>>>>>>
>> >>>>>>> Gilles
>> >>>>>>>
>> >>>>>>>
>> <spawn.c><rml.patch>_______________________________________________
>> >>>>>>> devel mailing list
>> >>>>>>> de...@open-mpi.org
>> >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >>>>>>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php
>> >>>>> _______________________________________________
>> >>>>> devel mailing list
>> >>>>> de...@open-mpi.org
>> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >>>>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/09/15794.php
>> >>>> _______________________________________________
>> >>>> devel mailing list
>> >>>> de...@open-mpi.org
>> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >>>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/09/15804.php
>> >>> <rml2.patch>_______________________________________________
>> >>> devel mailing list
>> >>> de...@open-mpi.org
>> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/09/15805.php
>> >> _______________________________________________
>> >> devel mailing list
>> >> de...@open-mpi.org
>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/09/15810.php
>> >
>> > _______________________________________________
>> > devel mailing list
>> > de...@open-mpi.org
>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> > Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/09/15814.php
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/09/15815.php
>>
>
> <rml3.patch>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15816.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15817.php
>

Re: [OMPI devel] race condition in grpcomm/rcd

Reply via email to