I think I now have this fixed - let me know what you see.

On Sep 9, 2014, at 6:15 AM, Ralph Castain <r...@open-mpi.org> wrote:

> Yeah, that's not the correct fix. The right way to fix it is for all three 
> components to have their own RML tag, and for each of them to establish a 
> persistent receive. They then can use the signature to tell which collective 
> the incoming message belongs to.
> 
> I'll fix it, but it won't be until tomorrow I'm afraid as today is shot.
> 
> 
> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@iferc.org> wrote:
> 
>> Folks,
>> 
>> Since r32672 (trunk), grpcomm/rcd is the default module.
>> the attached spawn.c test program is a trimmed version of the
>> spawn_with_env_vars.c test case
>> from the ibm test suite.
>> 
>> when invoked on two nodes :
>> - the program hangs with -np 2
>> - the program can crash with np > 2
>> error message is
>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1]
>> AND TAG -33 - ABORTING
>> 
>> here is my full command line (from node0) :
>> 
>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca
>> coll ^ml ./spawn
>> 
>> a simple workaround is to add the following extra parameter to the
>> mpirun command line :
>> --mca grpcomm_rcd_priority 0
>> 
>> my understanding it that the race condition occurs when all the
>> processes call MPI_Finalize()
>> internally, the pmix module will have mpirun/orted issue two ALLGATHER
>> involving mpirun and orted
>> (one job 1 aka the parent, and one for job 2 aka the spawned tasks)
>> the error message is very explicit : this is not (currently) supported
>> 
>> i wrote the attached rml.patch which is really a workaround and not a fix :
>> in this case, each job will invoke an ALLGATHER but with a different tag
>> /* that works for a limited number of jobs only */
>> 
>> i did not commit this patch since this is not a fix, could someone
>> (Ralph ?) please review the issue and comment ?
>> 
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> <spawn.c><rml.patch>_______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php
> 

Reply via email to