Pete -
I am trying to hack together a test case to implement what we had
talked about in the previous emails with a wr_credit...
I'm trying to keep track of it in the openib_device (od) structure
inside openib.c and would like to keep the necessary changes inside
openib.c if at all possible. The problem I'm running into, is that
I'm going to need to call check_cq() from inside the send_rdma writes
function, which lies in openib.c, not ib.c. openib.c has a
function for this but its really intended to work *with* ib.c's
check_cq() fucntionality...
In order to get around this I needed to make ib_check_cq() visible to
openib.c (got rid of the static and added a declaration to ib.h)..
but I'm getting weird things when I'm linking..
Any ideas how to get around this?
lib/libpvfs2-server.a(bmi-server.o):(.rodata+0x780): undefined
reference to `bmi_ib_ops'
collect2: ld returned 1 exit status
make: *** [src/server/pvfs2-server] Error 1
(I've attached a very rudimentary patch that sort of gets at what I'm
trying to do, not sure if its correct yet, still trying to compile)
On Fri, Mar 7, 2008 at 11:26 AM, Troy Benjegerdes <[EMAIL PROTECTED]> wrote:
> For further information, the diff to libmthca we used to figure this
> out, and some extensive logfiles of the problem occuring on the servers
> with full PVFS2_DEBUGMASK=network are at:
>
> http://scl.ameslab.gov/~troy/pvfs/ibv_post_send/
>
> (This is the error showing ibv_post_send failing with the -1001 error
> code I added)
>
> [D 12:33:00.861561] PVFS2 Server version 2.7.1pre1-2008-03-05-215140
> starting.
> [E 13:51:39.436430] openib_post_sr_rdmaw: ibv_post_send failed ret:
> -1001 errno: 0
> [E 13:51:39.445031] wr_id: 0x0 next: (nil) sg_list 0x65bb30 num_sge 1
> [E 13:51:39.445073] opcode: 0x0 send_flags: 0x0 imm_data: 0x0
> [E 13:51:39.445091] sr.wr.rdma.remote_addr: 0xf509c000 rkey 0x300055
> [E 13:51:39.445195] openib_post_sr_rdmaw: QP_request sge: 1
> [E 13:51:39.445249] Error: openib_post_sr_rdmaw: QP_sge: 28
> : Unknown error 18446744073709550615.
>
> Included in the logfiles in an attempt where I just tried to repost the
> send after 100 us for 10 retries, but that didn't seem to help. I am
> wondering if doing something like calling the ib_poll_cq function needs
> to be done to make some progress, or if there's some other way to back
> off out of openib_post_wr_rdma when the queues are full.
>
>
>
> Kyle Schochenmaier wrote:
> > Pete -
> >
> > We're still trying to track down this "bug" with how we use our ib
> > nics with pvfs2. This is a continuation of previous emails regarding
> > failures inside the openib_post_wr_rdma() functions for openib.c where
> > we get into a situation with running out of wq entries on the server
> > nics with multiple client processes hammering the filesystem. We only
> > see the servers going out here, the client ends up with a timeout
> > eventually.
> >
> > Troy and I have tracked down the specific resources down to the driver
> > that were being reported, ( and unfortunately sharing the same -errno
> > codes ).
> > So as far as we can tell, every time we get into this situation, we
> > get a wq_overflow() from the driver, and then of course the post_send
> > fails, leading to problems that are unrecoverable. We are pretty sure
> > that we're just running into the hw constraints of our nics, and that
> > the best way to deal with this type of thing would be to create some
> > sort of ib_flush_outgoing_requests() functionality for pvfs2 that
> > would either implement a backoff mechanism for the send requests to
> > wait for things at the nic level to be processed, or would just
> > 'flush' everything out.. We're not sure exactly how to go about this,
> > where or if this would be appropriate, or if we're missing something
> > obvious..
> > Can we recover from this elegantly?
> >
> > What does everyone think?
> >
> > ~~Kyle
> >
> >
> > Included is the path from pvfs2 to what we are seeing in the driver:
> >
> > pvfs2/src/io/bmi/bmi_ib/openib.c
> >
> > static void openib_post_sr_rdmaw(struct ib_work *sq, msg_header_cts_t
> *mh_cts,
> > void *mh_cts_buf)
> > {
> > <snip>
> >
> > ret = ibv_post_send(oc->qp, &sr, &bad_wr);
> >
> > <snip>
> > }
> >
> > -------------------------
> > ibv_post_send() points to this function for memfull mellanox cards in
> > libmthca-*/src/qp.c
> > -------------------------
> > int mthca_tavor_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
> > struct ibv_send_wr **bad_wr)
> >
> > {
> > struct mthca_qp *qp = to_mqp(ibqp);
> > void *wqe, *prev_wqe;
> > int ind;
> > int nreq;
> > int ret = 0;
> > int size;
> > int size0 = 0;
> > int i;
> > /*
> > * f0 and op0 cannot be used unless nreq > 0, which means this
> > * function makes it through the loop at least once. So the
> > * code inside the if (!size0) will be executed, and f0 and
> > * op0 will be initialized. So any gcc warning about "may be
> > * used unitialized" is bogus.
> > */
> > uint32_t f0;
> > uint32_t op0;
> >
> > pthread_spin_lock(&qp->sq.lock);
> >
> > ind = qp->sq.next_ind;
> >
> > for (nreq = 0; wr; ++nreq, wr = wr->next) {
> > ****** if (wq_overflow(&qp->sq, nreq,
> > to_mcq(qp->ibv_qp.send_cq))) {
> > ret = -1;
> > *bad_wr = wr;
> > goto out;
> > }
> >
> > <snip>
> >
> >
> >
>
>
--
Kyle Schochenmaier
Index: src/io/bmi/bmi_ib/ib.c
===================================================================
RCS file: /anoncvs/pvfs2/src/io/bmi/bmi_ib/ib.c,v
retrieving revision 1.67
diff -r1.67 ib.c
112c112
< static int ib_check_cq(void)
---
> int ib_check_cq(void)
Index: src/io/bmi/bmi_ib/ib.h
===================================================================
RCS file: /anoncvs/pvfs2/src/io/bmi/bmi_ib/ib.h,v
retrieving revision 1.33
diff -r1.33 ib.h
28a29,31
>
> int ib_check_cq();
>
57a61
> int wr_credit; /* make sure we dont overflow avail wr in hw */
Index: src/io/bmi/bmi_ib/openib.c
===================================================================
RCS file: /anoncvs/pvfs2/src/io/bmi/bmi_ib/openib.c,v
retrieving revision 1.16
diff -r1.16 openib.c
25a26
> int ib_check_cq(void);
40a42
> int wr_credit; /* credit used for prevent wq overflows */
160a163,166
> /* set the default max num of wr_credit to the num_wr to prevent
> * overflow of the wr's under heavy load on slower hardware */
> od->wr_credit = num_wr;
>
577a584,595
> /* test the wr_credit, if this will cause an overflow,
> * we need to implement an extra state to test things, and
> * allow some progress to be made while we are waiting for
> * wr completions
> */
>
> while(od->wr_credit >= od->nic_max_wr )
> {
> /* cause a test to hopefully pop something off the queue */
> od->wr_credit-=ib_check_cq();
> error("%s: full wr\n",__func__);
> }
580a599
> ++od->wr_credit;
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers