On Sep 12, 2014, at 5:45 AM, Gilles Gouaillardet <gilles.gouaillar...@gmail.com> wrote:
> Ralph, > > On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain <r...@open-mpi.org> wrote: > The design is supposed to be that each node knows precisely how many daemons > are involved in each collective, and who is going to talk to them. > > ok, but in the design does not ensure that things will happen in the right > order : > - enter the allgather > - receive data from the daemon at distance 1 > - receive data from the daemon at distance 2 > - and so on > > with current implementation when 2 daemons are involved, if a daemon enters > the allgather after it received data from the peer, then the mpi processes > local to this daemon will hang > > with 4 nodes, it got trickier : > 0 enter allgather and send a message to 1 > 1 receive the message and send to 2 but with data from 0 only > /* 1 did not enter the allgather, so its data cannot be sent to 2 */ It's just a bug in the rcd logic, Gilles. I'll take a look and get it fixed - just don't have time right now > > this issue did not occur before the persistent receive : > no receive was posted if the daemon did not enter the allgather > > > The signature contains the info required to ensure the receiver knows which > collective this message relates to, and just happens to also allow them to > lookup the number of daemons involved (the base function takes care of that > for them). > > > ok too, this issue was solved with the persistent receive > > So there is no need for a "pending" list - if you receive a message about a > collective you don't yet know about, you just put it on the ongoing > collective list. You should only receive it if you are going to be involved - > i.e., you have local procs that are going to participate. So you wait until > your local procs participate, and then pass your collected bucket along. > > ok, i did something similar > (e.g. pass all the available data) > some data might be passed twice, but that might not be an issue > > I suspect the link to the local procs isn't being correctly dealt with, else > you couldn't be hanging. Or the rcd isn't correctly passing incoming messages > to the base functions to register the collective. > > I'll look at it over the weekend and can resolve it then. > > > the attached patch is an illustration of what i was trying to explain. > coll->nreported is used by rcd as a bitmask of the received messages > (bit 0 is for the local daemon, bit n for the daemon at distance n) > > i was still debugging a race condition : > if daemons 2 and 3 enter the allgather at the send time, they will sent a > message to each other at the same time and rml fails establishing the > connection. i could not find whether this is linked to my changes... > > Cheers, > > Gilles > > On Sep 11, 2014, at 5:23 PM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > > > Ralph, > > > > you are right, this was definetly not the right fix (at least with 4 > > nodes or more) > > > > i finally understood what is going wrong here : > > to make it simple, the allgather recursive doubling algo is not > > implemented with > > MPI_Recv(...,peer,...) like functions but with > > MPI_Recv(...,MPI_ANY_SOURCE,...) like functions > > and that makes things slightly more complicated : > > right now : > > - with two nodes : if node 1 is late, it gets stuck in the allgather > > - with four nodes : if node 0 is first, then node 2 and 3 while node 1 > > is still late, then node 0 > > will likely leaves the allgather though it did not receive anything > > from node 1 > > - and so on > > > > i think i can fix that from now > > > > Cheers, > > > > Gilles > > > > On 2014/09/11 23:47, Ralph Castain wrote: > >> Yeah, that's not the right fix, I'm afraid. I've made the direct component > >> the default again until I have time to dig into this deeper. > >> > >> On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet > >> <gilles.gouaillar...@iferc.org> wrote: > >> > >>> Ralph, > >>> > >>> the root cause is when the second orted/mpirun runs rcd_finalize_coll, > >>> it does not invoke pmix_server_release > >>> because allgather_stub was not previously invoked since the the fence > >>> was not yet entered. > >>> /* in rcd_finalize_coll, coll->cbfunc is NULL */ > >>> > >>> the attached patch is likely not the right fix, it was very lightly > >>> tested, but so far, it works for me ... > >>> > >>> Cheers, > >>> > >>> Gilles > >>> > >>> On 2014/09/11 16:11, Gilles Gouaillardet wrote: > >>>> Ralph, > >>>> > >>>> things got worst indeed :-( > >>>> > >>>> now a simple hello world involving two hosts hang in mpi_init. > >>>> there is still a race condition : if a tasks a call fence long after > >>>> task b, > >>>> then task b will never leave the fence > >>>> > >>>> i ll try to debug this ... > >>>> > >>>> Cheers, > >>>> > >>>> Gilles > >>>> > >>>> On 2014/09/11 2:36, Ralph Castain wrote: > >>>>> I think I now have this fixed - let me know what you see. > >>>>> > >>>>> > >>>>> On Sep 9, 2014, at 6:15 AM, Ralph Castain <r...@open-mpi.org> wrote: > >>>>> > >>>>>> Yeah, that's not the correct fix. The right way to fix it is for all > >>>>>> three components to have their own RML tag, and for each of them to > >>>>>> establish a persistent receive. They then can use the signature to > >>>>>> tell which collective the incoming message belongs to. > >>>>>> > >>>>>> I'll fix it, but it won't be until tomorrow I'm afraid as today is > >>>>>> shot. > >>>>>> > >>>>>> > >>>>>> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet > >>>>>> <gilles.gouaillar...@iferc.org> wrote: > >>>>>> > >>>>>>> Folks, > >>>>>>> > >>>>>>> Since r32672 (trunk), grpcomm/rcd is the default module. > >>>>>>> the attached spawn.c test program is a trimmed version of the > >>>>>>> spawn_with_env_vars.c test case > >>>>>>> from the ibm test suite. > >>>>>>> > >>>>>>> when invoked on two nodes : > >>>>>>> - the program hangs with -np 2 > >>>>>>> - the program can crash with np > 2 > >>>>>>> error message is > >>>>>>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1] > >>>>>>> AND TAG -33 - ABORTING > >>>>>>> > >>>>>>> here is my full command line (from node0) : > >>>>>>> > >>>>>>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self > >>>>>>> --mca > >>>>>>> coll ^ml ./spawn > >>>>>>> > >>>>>>> a simple workaround is to add the following extra parameter to the > >>>>>>> mpirun command line : > >>>>>>> --mca grpcomm_rcd_priority 0 > >>>>>>> > >>>>>>> my understanding it that the race condition occurs when all the > >>>>>>> processes call MPI_Finalize() > >>>>>>> internally, the pmix module will have mpirun/orted issue two ALLGATHER > >>>>>>> involving mpirun and orted > >>>>>>> (one job 1 aka the parent, and one for job 2 aka the spawned tasks) > >>>>>>> the error message is very explicit : this is not (currently) supported > >>>>>>> > >>>>>>> i wrote the attached rml.patch which is really a workaround and not a > >>>>>>> fix : > >>>>>>> in this case, each job will invoke an ALLGATHER but with a different > >>>>>>> tag > >>>>>>> /* that works for a limited number of jobs only */ > >>>>>>> > >>>>>>> i did not commit this patch since this is not a fix, could someone > >>>>>>> (Ralph ?) please review the issue and comment ? > >>>>>>> > >>>>>>> > >>>>>>> Cheers, > >>>>>>> > >>>>>>> Gilles > >>>>>>> > >>>>>>> <spawn.c><rml.patch>_______________________________________________ > >>>>>>> devel mailing list > >>>>>>> de...@open-mpi.org > >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>>> Link to this post: > >>>>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php > >>>>> _______________________________________________ > >>>>> devel mailing list > >>>>> de...@open-mpi.org > >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>> Link to this post: > >>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15794.php > >>>> _______________________________________________ > >>>> devel mailing list > >>>> de...@open-mpi.org > >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>> Link to this post: > >>>> http://www.open-mpi.org/community/lists/devel/2014/09/15804.php > >>> <rml2.patch>_______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> Link to this post: > >>> http://www.open-mpi.org/community/lists/devel/2014/09/15805.php > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > >> http://www.open-mpi.org/community/lists/devel/2014/09/15810.php > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/09/15814.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15815.php > > <rml3.patch>_______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15816.php