On Mon, May 26, 2014 at 12:09:38PM +0900, Gilles Gouaillardet wrote:
>    Rolf,
> 
>    the assert fails because the endpoint reference count is greater than one.
>    the root cause is the endpoint has been added to the list of
>    eager_rdma_buffers of the openib btl device (and hence OBJ_RETAIN'ed at
>    ompi/mca/btl/openib/btl_openib_endpoint.c:1009)
> 
>    a simple workaround is not to use eager rdma with the openib btl
>    (e.g. export OMPI_MCA_btl_openib_use_eager_rdma=0)
> 
>    here is attached a patch that solves the issue.
> 
>    because of my poor understanding of the openib btl, i did not commit it.
>    i am wondering wether it is safe to simply OBJ_RELEASE the endpoint
>    (e.g. what happens if there are inflight messages ?)
>    i also added some comments that indicates the patch might be suboptimal.

It should be safe as there should be no flying messages at del_procs. If
there are an error would probably be raised on the sending process.

>    Nathan, could you please review the attached patch ?

Sure. I will take a look. It doesn't surprise me there are these sorts
of issues in del_procs. The functionality has been broken for some time.

-Nathan

Attachment: pgp6CEyEnPudm.pgp
Description: PGP signature

Reply via email to