[EMAIL PROTECTED] wrote on Fri, 07 Mar 2008 09:30 -0600:
> We're still trying to track down this "bug" with how we use our ib
> nics with pvfs2. This is a continuation of previous emails regarding
> failures inside the openib_post_wr_rdma() functions for openib.c where
> we get into a situation with running out of wq entries on the server
> nics with multiple client processes hammering the filesystem. We only
> see the servers going out here, the client ends up with a timeout
> eventually.
>
> Troy and I have tracked down the specific resources down to the driver
> that were being reported, ( and unfortunately sharing the same -errno
> codes ).
> So as far as we can tell, every time we get into this situation, we
> get a wq_overflow() from the driver, and then of course the post_send
> fails, leading to problems that are unrecoverable. We are pretty sure
> that we're just running into the hw constraints of our nics, and that
> the best way to deal with this type of thing would be to create some
> sort of ib_flush_outgoing_requests() functionality for pvfs2 that
> would either implement a backoff mechanism for the send requests to
> wait for things at the nic level to be processed, or would just
> 'flush' everything out.. We're not sure exactly how to go about this,
> where or if this would be appropriate, or if we're missing something
> obvious..
> Can we recover from this elegantly?
>
> What does everyone think?
Sounds plausible. Look at c->send_credit, which is used to make
sure we don't overflow the eager buffers. Add something like
c->wr_credit which would make sure you don't overflow the number of
work requests available. In openib_new_connection() we guess at
the number of outstanding work requests like this:
num_wr = ib_device->eager_buf_num + 50; /* plus some rdmaw */
so one for each send to an eager buf and 50 for random RDMA
operations. This number is allocated for both send and receive.
In your new c->wr_credit, you can set it to num_wr, then decrease it
each time a send is posted, and increase it each time the CQE for
the operation is reaped. If it is zero when you go to send, we
could add another waiting state and test_sq() will try the rdma
again periodically, waiting for enough CQs to be reaped.
Some debugging to track the min value of this new counter will help
us verify the code is working properly. Like maybe we're just
starving the CQ reaping for some reason. I will be surprised to
find out that you have more than 20 + 50 send operations to a single
client.
Could also be an adapter-wide resource limit that isn't properly
accounted for when creating the QP in mthca. Maybe show the (total
- min) across all the connections to figure out what the "total
adapter-wide number of WRs in flight" is and see if this correlates
with the mthca complaint.
-- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers