On Mon, May 26, 2014 at 12:09:38PM +0900, Gilles Gouaillardet wrote: > Rolf, > > the assert fails because the endpoint reference count is greater than one. > the root cause is the endpoint has been added to the list of > eager_rdma_buffers of the openib btl device (and hence OBJ_RETAIN'ed at > ompi/mca/btl/openib/btl_openib_endpoint.c:1009) > > a simple workaround is not to use eager rdma with the openib btl > (e.g. export OMPI_MCA_btl_openib_use_eager_rdma=0) > > here is attached a patch that solves the issue. > > because of my poor understanding of the openib btl, i did not commit it. > i am wondering wether it is safe to simply OBJ_RELEASE the endpoint > (e.g. what happens if there are inflight messages ?) > i also added some comments that indicates the patch might be suboptimal.
It should be safe as there should be no flying messages at del_procs. If there are an error would probably be raised on the sending process. > Nathan, could you please review the attached patch ? Sure. I will take a look. It doesn't surprise me there are these sorts of issues in del_procs. The functionality has been broken for some time. -Nathan
pgp6CEyEnPudm.pgp
Description: PGP signature