Rolf,

the assert fails because the endpoint reference count is greater than one.
the root cause is the endpoint has been added to the list of
eager_rdma_buffers of the openib btl device (and hence OBJ_RETAIN'ed at
ompi/mca/btl/openib/btl_openib_endpoint.c:1009)

a simple workaround is not to use eager rdma with the openib btl
(e.g. export OMPI_MCA_btl_openib_use_eager_rdma=0)

here is attached a patch that solves the issue.

because of my poor understanding of the openib btl, i did not commit it.
i am wondering wether it is safe to simply OBJ_RELEASE the endpoint
(e.g. what happens if there are inflight messages ?)
i also added some comments that indicates the patch might be suboptimal.

Nathan, could you please review the attached patch ?

please note that if the faulty assertion is removed, the endpoint will be
OBJ_RELEASE'd  but only in the btl finalize.

Gilles



On Sat, May 24, 2014 at 12:31 AM, Rolf vandeVaart <rvandeva...@nvidia.com>wrote:

> I am still seeing problems with del_procs with openib.  Do we believe
> everything should be working?  This is with the latest trunk (updated 1
> hour ago).
>
> [rvandevaart@drossetti-ivy0 examples]$ mpirun --mca btl_openib_if_include
> mlx5_0:1 -np 2 -host drossetti-ivy0,drossetti-ivy1
> connectivity_cConnectivity test on 2 processes PASSED.
> connectivity_c: ../../../../../ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> connectivity_c: ../../../../../ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 28443 on node drossetti-ivy1
> exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> [rvandevaart@drossetti-ivy0 examples]$
>
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and
> may contain
> confidential information.  Any unauthorized review, use, disclosure or
> distribution
> is prohibited.  If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
>
> -----------------------------------------------------------------------------------
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14836.php
>
Index: ompi/mca/btl/openib/btl_openib.c
===================================================================
--- ompi/mca/btl/openib/btl_openib.c    (revision 31888)
+++ ompi/mca/btl/openib/btl_openib.c    (working copy)
@@ -1128,7 +1128,7 @@
         struct ompi_proc_t **procs,
         struct mca_btl_base_endpoint_t ** peers)
 {
-    int i,ep_index;
+    int i, ep_index;
     mca_btl_openib_module_t* openib_btl = (mca_btl_openib_module_t*) btl;
     mca_btl_openib_endpoint_t* endpoint;
 
@@ -1144,8 +1144,19 @@
                 continue;
             }
             if (endpoint == del_endpoint) {
+                int j;
                 BTL_VERBOSE(("in del_procs %d, setting another endpoint to 
null",
                              ep_index));
+                /* remove the endpoint from eager_rdma_buffers */
+                for (j=0; j<openib_btl->device->eager_rdma_buffers_count; j++) 
{
+                    if (openib_btl->device->eager_rdma_buffers[j] == endpoint) 
{
+                        /* should it be obj_reference_count == 2 ? */
+                        assert(((opal_object_t*)endpoint)->obj_reference_count 
> 1);
+                        OBJ_RELEASE(endpoint);
+                        openib_btl->device->eager_rdma_buffers[j] = NULL;
+                        /* can we simply break and leave the for loop ? */
+                    }
+                }
                 opal_pointer_array_set_item(openib_btl->device->endpoints,
                         ep_index, NULL);
                 assert(((opal_object_t*)endpoint)->obj_reference_count == 1);

Reply via email to