Re: [OMPI devel] race condition in coll/ml

Ralph Castain Mon, 1 Sep 2014 10:20:34 -0400 (EDT)

Usually we have trouble with coll/ml because the process locality isn't being 
reported sufficiently for its needs. Given the recent change in data exchange, 
I suspect that is the root cause here - I have a note to Nathan asking for 
clarification of the coll/ml locality requirement.


Did this patch "fix" the problem by avoiding the segfault due to coll/ml 
disqualifying itself? Or did it make everything work okay again?


On Sep 1, 2014, at 3:16 AM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> 
wrote:

> Folks,
> 
> mtt recently failed a bunch of times with the trunk.
> a good suspect is the collective/ibarrier test from the ibm test suite.
> 
> most of the time, CHECK_AND_RECYCLE will fail
> /* IS_COLL_SYNCMEM(coll_op) is true */
> 
> with this test case, we just get a glory SIGSEGV since OBJ_RELEASE is
> called on MPI_COMM_WORLD (which has *not* been allocated with OBJ_NEW)
> 
> i commited r32659 in order to :
> - display an error message
> - abort if the communicator is an intrincic one
> 
> with attached modified version of the ibarrier test, i always get an
> error on task 0 when invoked with
> mpirun -np 2 -host node0,node1 --mca btl tcp,self ./ibarrier
> 
> the modified version adds some sleep(1) in order to work around the race
> condition and get a reproducible crash
> 
> i tried to dig and could not find a correct way to fix this.
> that being said, i tried the attached ml.patch and it did fix the
> problem (even with NREQS=1024)
> i did not commit it since this is very likely incorrect.
> 
> could someone have a look ?
> 
> Cheers,
> 
> Gilles
> <ibarrier.c><ml.patch>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15767.php

Re: [OMPI devel] race condition in coll/ml

Reply via email to