On Wednesday 26 March 2008 05:56, Tang, Changqing wrote: > > Hi, > We are debuging our dynamic process code, when we call > > ret = ibv_poll_cq(cq_hndl, 1, &compl); > > The peer process may have destroyed the QP. > > However, ibv_poll_cq() return -2 in 'ret', 'errno' is still 0 > > What could be the reason for this error ? > > There is a posted send pending for completion, so error should be > reported via the completion status, not the polling function > itself. > > Thanks for any help. This is OFED 1.3
Roland, It looks like we have a race condition in mlx4_destroy_qp. We clean the cq BEFORE modifying the QP to reset (done in kernel as part of the ibv_cmd_destroy_qp() flow). CQ's problem has exposed this bug. mlx4_cq_clean needs to be invoked **after** the destroy: Index: libmlx4/src/verbs.c =================================================================== --- libmlx4.orig/src/verbs.c 2008-03-26 09:00:08.000000000 +0200 +++ libmlx4/src/verbs.c 2008-03-26 09:00:52.449586000 +0200 @@ -558,11 +558,6 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp) struct mlx4_qp *qp = to_mqp(ibqp); int ret; - mlx4_cq_clean(to_mcq(ibqp->recv_cq), ibqp->qp_num, - ibqp->srq ? to_msrq(ibqp->srq) : NULL); - if (ibqp->send_cq != ibqp->recv_cq) - mlx4_cq_clean(to_mcq(ibqp->send_cq), ibqp->qp_num, NULL); - mlx4_lock_cqs(ibqp); mlx4_clear_qp(to_mctx(ibqp->context), ibqp->qp_num); mlx4_unlock_cqs(ibqp); @@ -576,6 +571,11 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp) return ret; } + mlx4_cq_clean(to_mcq(ibqp->recv_cq), ibqp->qp_num, + ibqp->srq ? to_msrq(ibqp->srq) : NULL); + if (ibqp->send_cq != ibqp->recv_cq) + mlx4_cq_clean(to_mcq(ibqp->send_cq), ibqp->qp_num, NULL); + if (!ibqp->srq && ibqp->qp_type != IBV_QPT_XRC) mlx4_free_db(to_mctx(ibqp->context), MLX4_DB_TYPE_RQ, qp->db); free(qp->sq.wrid); _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general