On Fri, Nov 20, 2009 at 9:08 PM, Sean Hefty <[email protected]> wrote: >>mlx4/qp.c: mlx4_ib_post_send() >>* when passing a list containing more than one item to >>mlx4_ib_post_send(), and sending the second or later item fails (e.g. >>because of QP overflow), the preceding items are sent anyway. This >>behavior makes it almost impossible to get error recovery right for >>block device implementations that use ib_post_send() (e.g. the SRPT >>target implementation). > > Yes - this is the correct behavior. The bad_wr pointer should reference the > WR > that failed, with all WRs in the list passed that point being returned > unprocessed. This is the reason for having the bad_wr in the call. Error > recovery shouldn't be any more difficult than posting one WR at a time. > >>If my interpretation of the section about verbs in the InfiniBand >>Architecture Specification is correct, either all work requests should >>be processed or none. A quote from section 11.4.1.1, Post Send Request >>(page 622 in volume 1 of release 1.2.1): > > The IB spec does not define an API. For performance reasons, you don't want > the > implementation to walk through the WR list multiple times - once to check it, > then a second time to actually post the requests to the hardware.
Thanks for the feedback. I have two further questions: - Where can IB driver developers find detailed specifications of the verbs API they should implement ? I learned about the details of the behavior of the ib_post_send() call by reading the mlx4 source code. Shouldn't this behavior be documented in include/rdma/ib_verbs.h instead ? - Does walking twice over the WR list always result in inferior performance compared to walking once over this list ? Both the iSER protocol and the SRP protocol allow to send large sg lists (e.g. containing 128 elements) at once over the wire. When using asynchronous (buffered) I/O, this maximum is often reached. One interesting performance optimization is to send all 128 sg elements at once using one ib_post_send() call and to request a completion notification for the last WR only. But if the ib_post_send() call returns an immediate error and has sent part of the WR list, no completion notification will be received. So code that calls ib_post_send() has to request a completion notification for each WR, which has a negative performance impact. My opinion is that the current behavior makes ib_post_send() easier to implement, while the behavior specified in the IBAS is more interesting for applications that use the verbs API. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
