Ralph, you are right, this was definetly not the right fix (at least with 4 nodes or more)
i finally understood what is going wrong here : to make it simple, the allgather recursive doubling algo is not implemented with MPI_Recv(...,peer,...) like functions but with MPI_Recv(...,MPI_ANY_SOURCE,...) like functions and that makes things slightly more complicated : right now : - with two nodes : if node 1 is late, it gets stuck in the allgather - with four nodes : if node 0 is first, then node 2 and 3 while node 1 is still late, then node 0 will likely leaves the allgather though it did not receive anything from node 1 - and so on i think i can fix that from now Cheers, Gilles On 2014/09/11 23:47, Ralph Castain wrote: > Yeah, that's not the right fix, I'm afraid. I've made the direct component > the default again until I have time to dig into this deeper. > > On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > >> Ralph, >> >> the root cause is when the second orted/mpirun runs rcd_finalize_coll, >> it does not invoke pmix_server_release >> because allgather_stub was not previously invoked since the the fence >> was not yet entered. >> /* in rcd_finalize_coll, coll->cbfunc is NULL */ >> >> the attached patch is likely not the right fix, it was very lightly >> tested, but so far, it works for me ... >> >> Cheers, >> >> Gilles >> >> On 2014/09/11 16:11, Gilles Gouaillardet wrote: >>> Ralph, >>> >>> things got worst indeed :-( >>> >>> now a simple hello world involving two hosts hang in mpi_init. >>> there is still a race condition : if a tasks a call fence long after task b, >>> then task b will never leave the fence >>> >>> i ll try to debug this ... >>> >>> Cheers, >>> >>> Gilles >>> >>> On 2014/09/11 2:36, Ralph Castain wrote: >>>> I think I now have this fixed - let me know what you see. >>>> >>>> >>>> On Sep 9, 2014, at 6:15 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>> >>>>> Yeah, that's not the correct fix. The right way to fix it is for all >>>>> three components to have their own RML tag, and for each of them to >>>>> establish a persistent receive. They then can use the signature to tell >>>>> which collective the incoming message belongs to. >>>>> >>>>> I'll fix it, but it won't be until tomorrow I'm afraid as today is shot. >>>>> >>>>> >>>>> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet >>>>> <gilles.gouaillar...@iferc.org> wrote: >>>>> >>>>>> Folks, >>>>>> >>>>>> Since r32672 (trunk), grpcomm/rcd is the default module. >>>>>> the attached spawn.c test program is a trimmed version of the >>>>>> spawn_with_env_vars.c test case >>>>>> from the ibm test suite. >>>>>> >>>>>> when invoked on two nodes : >>>>>> - the program hangs with -np 2 >>>>>> - the program can crash with np > 2 >>>>>> error message is >>>>>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1] >>>>>> AND TAG -33 - ABORTING >>>>>> >>>>>> here is my full command line (from node0) : >>>>>> >>>>>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca >>>>>> coll ^ml ./spawn >>>>>> >>>>>> a simple workaround is to add the following extra parameter to the >>>>>> mpirun command line : >>>>>> --mca grpcomm_rcd_priority 0 >>>>>> >>>>>> my understanding it that the race condition occurs when all the >>>>>> processes call MPI_Finalize() >>>>>> internally, the pmix module will have mpirun/orted issue two ALLGATHER >>>>>> involving mpirun and orted >>>>>> (one job 1 aka the parent, and one for job 2 aka the spawned tasks) >>>>>> the error message is very explicit : this is not (currently) supported >>>>>> >>>>>> i wrote the attached rml.patch which is really a workaround and not a >>>>>> fix : >>>>>> in this case, each job will invoke an ALLGATHER but with a different tag >>>>>> /* that works for a limited number of jobs only */ >>>>>> >>>>>> i did not commit this patch since this is not a fix, could someone >>>>>> (Ralph ?) please review the issue and comment ? >>>>>> >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Gilles >>>>>> >>>>>> <spawn.c><rml.patch>_______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/09/15794.php >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/09/15804.php >> <rml2.patch>_______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15805.php > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15810.php