Let me know if Nadia can help here, Ralph. Josh
On Fri, Sep 12, 2014 at 9:31 AM, Ralph Castain <r...@open-mpi.org> wrote: > > On Sep 12, 2014, at 5:45 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > > Ralph, > > On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> The design is supposed to be that each node knows precisely how many >> daemons are involved in each collective, and who is going to talk to them. > > > ok, but in the design does not ensure that things will happen in the right > order : > - enter the allgather > - receive data from the daemon at distance 1 > - receive data from the daemon at distance 2 > - and so on > > with current implementation when 2 daemons are involved, if a daemon > enters the allgather after it received data from the peer, then the mpi > processes local to this daemon will hang > > with 4 nodes, it got trickier : > 0 enter allgather and send a message to 1 > 1 receive the message and send to 2 but with data from 0 only > /* 1 did not enter the allgather, so its data cannot be sent to 2 */ > > > It's just a bug in the rcd logic, Gilles. I'll take a look and get it > fixed - just don't have time right now > > > this issue did not occur before the persistent receive : > no receive was posted if the daemon did not enter the allgather > > > The signature contains the info required to ensure the receiver knows >> which collective this message relates to, and just happens to also allow >> them to lookup the number of daemons involved (the base function takes care >> of that for them). >> >> > ok too, this issue was solved with the persistent receive > > So there is no need for a "pending" list - if you receive a message about >> a collective you don't yet know about, you just put it on the ongoing >> collective list. You should only receive it if you are going to be involved >> - i.e., you have local procs that are going to participate. So you wait >> until your local procs participate, and then pass your collected bucket >> along. >> >> ok, i did something similar > (e.g. pass all the available data) > some data might be passed twice, but that might not be an issue > > >> I suspect the link to the local procs isn't being correctly dealt with, >> else you couldn't be hanging. Or the rcd isn't correctly passing incoming >> messages to the base functions to register the collective. >> >> I'll look at it over the weekend and can resolve it then. >> >> > the attached patch is an illustration of what i was trying to explain. > coll->nreported is used by rcd as a bitmask of the received messages > (bit 0 is for the local daemon, bit n for the daemon at distance n) > > i was still debugging a race condition : > if daemons 2 and 3 enter the allgather at the send time, they will sent a > message to each other at the same time and rml fails establishing the > connection. i could not find whether this is linked to my changes... > > Cheers, > > Gilles > >> >> On Sep 11, 2014, at 5:23 PM, Gilles Gouaillardet < >> gilles.gouaillar...@iferc.org> wrote: >> >> > Ralph, >> > >> > you are right, this was definetly not the right fix (at least with 4 >> > nodes or more) >> > >> > i finally understood what is going wrong here : >> > to make it simple, the allgather recursive doubling algo is not >> > implemented with >> > MPI_Recv(...,peer,...) like functions but with >> > MPI_Recv(...,MPI_ANY_SOURCE,...) like functions >> > and that makes things slightly more complicated : >> > right now : >> > - with two nodes : if node 1 is late, it gets stuck in the allgather >> > - with four nodes : if node 0 is first, then node 2 and 3 while node 1 >> > is still late, then node 0 >> > will likely leaves the allgather though it did not receive anything >> > from node 1 >> > - and so on >> > >> > i think i can fix that from now >> > >> > Cheers, >> > >> > Gilles >> > >> > On 2014/09/11 23:47, Ralph Castain wrote: >> >> Yeah, that's not the right fix, I'm afraid. I've made the direct >> component the default again until I have time to dig into this deeper. >> >> >> >> On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet < >> gilles.gouaillar...@iferc.org> wrote: >> >> >> >>> Ralph, >> >>> >> >>> the root cause is when the second orted/mpirun runs rcd_finalize_coll, >> >>> it does not invoke pmix_server_release >> >>> because allgather_stub was not previously invoked since the the fence >> >>> was not yet entered. >> >>> /* in rcd_finalize_coll, coll->cbfunc is NULL */ >> >>> >> >>> the attached patch is likely not the right fix, it was very lightly >> >>> tested, but so far, it works for me ... >> >>> >> >>> Cheers, >> >>> >> >>> Gilles >> >>> >> >>> On 2014/09/11 16:11, Gilles Gouaillardet wrote: >> >>>> Ralph, >> >>>> >> >>>> things got worst indeed :-( >> >>>> >> >>>> now a simple hello world involving two hosts hang in mpi_init. >> >>>> there is still a race condition : if a tasks a call fence long after >> task b, >> >>>> then task b will never leave the fence >> >>>> >> >>>> i ll try to debug this ... >> >>>> >> >>>> Cheers, >> >>>> >> >>>> Gilles >> >>>> >> >>>> On 2014/09/11 2:36, Ralph Castain wrote: >> >>>>> I think I now have this fixed - let me know what you see. >> >>>>> >> >>>>> >> >>>>> On Sep 9, 2014, at 6:15 AM, Ralph Castain <r...@open-mpi.org> wrote: >> >>>>> >> >>>>>> Yeah, that's not the correct fix. The right way to fix it is for >> all three components to have their own RML tag, and for each of them to >> establish a persistent receive. They then can use the signature to tell >> which collective the incoming message belongs to. >> >>>>>> >> >>>>>> I'll fix it, but it won't be until tomorrow I'm afraid as today is >> shot. >> >>>>>> >> >>>>>> >> >>>>>> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet < >> gilles.gouaillar...@iferc.org> wrote: >> >>>>>> >> >>>>>>> Folks, >> >>>>>>> >> >>>>>>> Since r32672 (trunk), grpcomm/rcd is the default module. >> >>>>>>> the attached spawn.c test program is a trimmed version of the >> >>>>>>> spawn_with_env_vars.c test case >> >>>>>>> from the ibm test suite. >> >>>>>>> >> >>>>>>> when invoked on two nodes : >> >>>>>>> - the program hangs with -np 2 >> >>>>>>> - the program can crash with np > 2 >> >>>>>>> error message is >> >>>>>>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER >> [[42913,0],1] >> >>>>>>> AND TAG -33 - ABORTING >> >>>>>>> >> >>>>>>> here is my full command line (from node0) : >> >>>>>>> >> >>>>>>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self >> --mca >> >>>>>>> coll ^ml ./spawn >> >>>>>>> >> >>>>>>> a simple workaround is to add the following extra parameter to the >> >>>>>>> mpirun command line : >> >>>>>>> --mca grpcomm_rcd_priority 0 >> >>>>>>> >> >>>>>>> my understanding it that the race condition occurs when all the >> >>>>>>> processes call MPI_Finalize() >> >>>>>>> internally, the pmix module will have mpirun/orted issue two >> ALLGATHER >> >>>>>>> involving mpirun and orted >> >>>>>>> (one job 1 aka the parent, and one for job 2 aka the spawned >> tasks) >> >>>>>>> the error message is very explicit : this is not (currently) >> supported >> >>>>>>> >> >>>>>>> i wrote the attached rml.patch which is really a workaround and >> not a fix : >> >>>>>>> in this case, each job will invoke an ALLGATHER but with a >> different tag >> >>>>>>> /* that works for a limited number of jobs only */ >> >>>>>>> >> >>>>>>> i did not commit this patch since this is not a fix, could someone >> >>>>>>> (Ralph ?) please review the issue and comment ? >> >>>>>>> >> >>>>>>> >> >>>>>>> Cheers, >> >>>>>>> >> >>>>>>> Gilles >> >>>>>>> >> >>>>>>> >> <spawn.c><rml.patch>_______________________________________________ >> >>>>>>> devel mailing list >> >>>>>>> de...@open-mpi.org >> >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >>>>>>> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php >> >>>>> _______________________________________________ >> >>>>> devel mailing list >> >>>>> de...@open-mpi.org >> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >>>>> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15794.php >> >>>> _______________________________________________ >> >>>> devel mailing list >> >>>> de...@open-mpi.org >> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >>>> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15804.php >> >>> <rml2.patch>_______________________________________________ >> >>> devel mailing list >> >>> de...@open-mpi.org >> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >>> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15805.php >> >> _______________________________________________ >> >> devel mailing list >> >> de...@open-mpi.org >> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15810.php >> > >> > _______________________________________________ >> > devel mailing list >> > de...@open-mpi.org >> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15814.php >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15815.php >> > > <rml3.patch>_______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15816.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15817.php >