Pete -

We're still trying to track down this "bug" with how we use our ib
nics with pvfs2.  This is a continuation of previous emails regarding
failures inside the openib_post_wr_rdma() functions for openib.c where
we get into a situation with running out of wq entries on the server
nics with multiple client processes hammering the filesystem.  We only
see the servers going out here, the client ends up with a timeout
eventually.

Troy and I have tracked down the specific resources down to the driver
that were being reported, ( and unfortunately sharing the same -errno
codes ).
So as far as we can tell, every time we get into this situation, we
get a wq_overflow() from the driver, and then of course the post_send
fails, leading to problems that are unrecoverable.  We are pretty sure
that we're just running into the hw constraints of our nics, and that
the best way to deal with this type of thing would be to create some
sort of ib_flush_outgoing_requests() functionality for pvfs2 that
would either implement a backoff mechanism for the send requests to
wait for things at the nic level to be processed, or would just
'flush' everything out.. We're not sure exactly how to go about this,
where or if this would be appropriate, or if we're missing something
obvious..
Can we recover from this elegantly?

What does everyone think?

~~Kyle


Included is the path from pvfs2 to what we are seeing in the driver:

pvfs2/src/io/bmi/bmi_ib/openib.c

static void openib_post_sr_rdmaw(struct ib_work *sq, msg_header_cts_t *mh_cts,
                                 void *mh_cts_buf)
{
<snip>

        ret = ibv_post_send(oc->qp, &sr, &bad_wr);

<snip>
}

-------------------------
ibv_post_send()  points to this function for memfull mellanox cards in
libmthca-*/src/qp.c
-------------------------
int mthca_tavor_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
                          struct ibv_send_wr **bad_wr)

{
        struct mthca_qp *qp = to_mqp(ibqp);
        void *wqe, *prev_wqe;
        int ind;
        int nreq;
        int ret = 0;
        int size;
        int size0 = 0;
        int i;
        /*
         * f0 and op0 cannot be used unless nreq > 0, which means this
         * function makes it through the loop at least once.  So the
         * code inside the if (!size0) will be executed, and f0 and
         * op0 will be initialized.  So any gcc warning about "may be
         * used unitialized" is bogus.
         */
        uint32_t f0;
        uint32_t op0;

        pthread_spin_lock(&qp->sq.lock);

        ind = qp->sq.next_ind;

        for (nreq = 0; wr; ++nreq, wr = wr->next) {
******                if (wq_overflow(&qp->sq, nreq,
to_mcq(qp->ibv_qp.send_cq))) {
                        ret = -1;
                        *bad_wr = wr;
                        goto out;
                }

<snip>



-- 
Kyle Schochenmaier
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to