Rolf, the assert fails because the endpoint reference count is greater than one. the root cause is the endpoint has been added to the list of eager_rdma_buffers of the openib btl device (and hence OBJ_RETAIN'ed at ompi/mca/btl/openib/btl_openib_endpoint.c:1009)
a simple workaround is not to use eager rdma with the openib btl (e.g. export OMPI_MCA_btl_openib_use_eager_rdma=0) here is attached a patch that solves the issue. because of my poor understanding of the openib btl, i did not commit it. i am wondering wether it is safe to simply OBJ_RELEASE the endpoint (e.g. what happens if there are inflight messages ?) i also added some comments that indicates the patch might be suboptimal. Nathan, could you please review the attached patch ? please note that if the faulty assertion is removed, the endpoint will be OBJ_RELEASE'd but only in the btl finalize. Gilles On Sat, May 24, 2014 at 12:31 AM, Rolf vandeVaart <rvandeva...@nvidia.com>wrote: > I am still seeing problems with del_procs with openib. Do we believe > everything should be working? This is with the latest trunk (updated 1 > hour ago). > > [rvandevaart@drossetti-ivy0 examples]$ mpirun --mca btl_openib_if_include > mlx5_0:1 -np 2 -host drossetti-ivy0,drossetti-ivy1 > connectivity_cConnectivity test on 2 processes PASSED. > connectivity_c: ../../../../../ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > connectivity_c: ../../../../../ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > -------------------------------------------------------------------------- > mpirun noticed that process rank 1 with PID 28443 on node drossetti-ivy1 > exited on signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > [rvandevaart@drossetti-ivy0 examples]$ > > ----------------------------------------------------------------------------------- > This email message is for the sole use of the intended recipient(s) and > may contain > confidential information. Any unauthorized review, use, disclosure or > distribution > is prohibited. If you are not the intended recipient, please contact the > sender by > reply email and destroy all copies of the original message. > > ----------------------------------------------------------------------------------- > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14836.php >
Index: ompi/mca/btl/openib/btl_openib.c =================================================================== --- ompi/mca/btl/openib/btl_openib.c (revision 31888) +++ ompi/mca/btl/openib/btl_openib.c (working copy) @@ -1128,7 +1128,7 @@ struct ompi_proc_t **procs, struct mca_btl_base_endpoint_t ** peers) { - int i,ep_index; + int i, ep_index; mca_btl_openib_module_t* openib_btl = (mca_btl_openib_module_t*) btl; mca_btl_openib_endpoint_t* endpoint; @@ -1144,8 +1144,19 @@ continue; } if (endpoint == del_endpoint) { + int j; BTL_VERBOSE(("in del_procs %d, setting another endpoint to null", ep_index)); + /* remove the endpoint from eager_rdma_buffers */ + for (j=0; j<openib_btl->device->eager_rdma_buffers_count; j++) { + if (openib_btl->device->eager_rdma_buffers[j] == endpoint) { + /* should it be obj_reference_count == 2 ? */ + assert(((opal_object_t*)endpoint)->obj_reference_count > 1); + OBJ_RELEASE(endpoint); + openib_btl->device->eager_rdma_buffers[j] = NULL; + /* can we simply break and leave the for loop ? */ + } + } opal_pointer_array_set_item(openib_btl->device->endpoints, ep_index, NULL); assert(((opal_object_t*)endpoint)->obj_reference_count == 1);