Usually we have trouble with coll/ml because the process locality isn't being reported sufficiently for its needs. Given the recent change in data exchange, I suspect that is the root cause here - I have a note to Nathan asking for clarification of the coll/ml locality requirement.
Did this patch "fix" the problem by avoiding the segfault due to coll/ml disqualifying itself? Or did it make everything work okay again? On Sep 1, 2014, at 3:16 AM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote: > Folks, > > mtt recently failed a bunch of times with the trunk. > a good suspect is the collective/ibarrier test from the ibm test suite. > > most of the time, CHECK_AND_RECYCLE will fail > /* IS_COLL_SYNCMEM(coll_op) is true */ > > with this test case, we just get a glory SIGSEGV since OBJ_RELEASE is > called on MPI_COMM_WORLD (which has *not* been allocated with OBJ_NEW) > > i commited r32659 in order to : > - display an error message > - abort if the communicator is an intrincic one > > with attached modified version of the ibarrier test, i always get an > error on task 0 when invoked with > mpirun -np 2 -host node0,node1 --mca btl tcp,self ./ibarrier > > the modified version adds some sleep(1) in order to work around the race > condition and get a reproducible crash > > i tried to dig and could not find a correct way to fix this. > that being said, i tried the attached ml.patch and it did fix the > problem (even with NREQS=1024) > i did not commit it since this is very likely incorrect. > > could someone have a look ? > > Cheers, > > Gilles > <ibarrier.c><ml.patch>_______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15767.php