Howard, and Rolf, i initially reported the issue at http://www.open-mpi.org/community/lists/devel/2014/09/15767.php
r32659 is not a fix nor a regression, it simply aborts instead of OBJ_RELEASE(mpi_comm_world). /* my point here is we should focus on the root cause and not the consequence */ first, this is a race condition, so one run is not enough to conclude the problem is fixed. second, if you do not configure with --enable-debug, there might be a silent data corruption with undefined results if the bug is hit. undefined result can mean the test success. bottom line and imho : - if your test success without r32659, it just means you were lucky ... - an abort with an understandable error message is better than a silent corruption last but not least, r32659 was acked for v1.8 8 #4888). coll/ml priority is now zero in v1.8 and this is likely the only reason why you do not see any errors in this branch. Cheers, Gilles On Tue, Sep 16, 2014 at 8:33 AM, Pritchard Jr., Howard <howa...@lanl.gov> wrote: > HI Rolf, > > > > Okay. I’ll work with ORNL folks to see how to really fix this. > > > > Howard > > > > > > *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Rolf > vandeVaart > *Sent:* Monday, September 15, 2014 3:10 PM > > *To:* Open MPI Developers > *Subject:* Re: [OMPI devel] coll ml error with some nonblocking > collectives > > > > Confirmed that trunk version r32658 does pass the test. > > > > *From:* devel [mailto:devel-boun...@open-mpi.org > <devel-boun...@open-mpi.org>] *On Behalf Of *Pritchard Jr., Howard > *Sent:* Monday, September 15, 2014 4:16 PM > *To:* Open MPI Developers > *Subject:* Re: [OMPI devel] coll ml error with some nonblocking > collectives > > > > Hi Rolf, > > > > This may be related to change set 32659. > > > > If you back this change out, do the tests pass? > > > > > > Howard > > > > > > > > > > *From:* devel [mailto:devel-boun...@open-mpi.org > <devel-boun...@open-mpi.org>] *On Behalf Of *Rolf vandeVaart > *Sent:* Monday, September 15, 2014 8:55 AM > *To:* de...@open-mpi.org > *Subject:* [OMPI devel] coll ml error with some nonblocking collectives > > > > I wonder if anyone else is seeing this failure. Not sure when this started > but it is only on the trunk. Here is a link to my failures as well as an > example below that. There are a variety of nonblocking collectives failing > like this. > > > > http://mtt.open-mpi.org/index.php?do_redir=2208 > > > > [rvandevaart@drossetti-ivy0 collective]$ mpirun --mca btl self,sm,tcp > -host drossetti-ivy0,drossetti-ivy0,drossetti-ivy1,drossetti-ivy1 iallreduce > -------------------------------------------------------------------------- > > ML detected an unrecoverable error on intrinsic communicator MPI_COMM_WORLD > > The program will now abort > -------------------------------------------------------------------------- > [drossetti-ivy0.nvidia.com:04664] 3 more processes have sent help message > help-mpi-coll-ml.txt / coll-ml-check-fatal-error > [rvandevaart@drossetti-ivy0 collective]$ > ------------------------------ > > This email message is for the sole use of the intended recipient(s) and > may contain confidential information. Any unauthorized review, use, > disclosure or distribution is prohibited. If you are not the intended > recipient, please contact the sender by reply email and destroy all copies > of the original message. > ------------------------------ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15834.php >