Re: ib_post_send in drivers

Bart Van Assche Sat, 21 Nov 2009 03:17:38 -0800

On Fri, Nov 20, 2009 at 9:08 PM, Sean Hefty <[email protected]> wrote:
>>mlx4/qp.c: mlx4_ib_post_send()
>>* when passing a list containing more than one item to
>>mlx4_ib_post_send(), and sending the second or later item fails (e.g.
>>because of QP overflow), the preceding items are sent anyway. This
>>behavior makes it almost impossible to get error recovery right for
>>block device implementations that use ib_post_send() (e.g. the SRPT
>>target implementation).
>
> Yes - this is the correct behavior.  The bad_wr pointer should reference the 
> WR
> that failed, with all WRs in the list passed that point being returned
> unprocessed.  This is the reason for having the bad_wr in the call.  Error
> recovery shouldn't be any more difficult than posting one WR at a time.
>
>>If my interpretation of the section about verbs in the InfiniBand
>>Architecture Specification is correct, either all work requests should
>>be processed or none. A quote from section 11.4.1.1, Post Send Request
>>(page 622 in volume 1 of release 1.2.1):
>
> The IB spec does not define an API.  For performance reasons, you don't want 
> the
> implementation to walk through the WR list multiple times - once to check it,
> then a second time to actually post the requests to the hardware.


Thanks for the feedback. I have two further questions:
- Where can IB driver developers find detailed specifications of the
verbs API they should implement ? I learned about the details of the
behavior of the ib_post_send() call by reading the mlx4 source code.
Shouldn't this behavior be documented in include/rdma/ib_verbs.h
instead ?
- Does walking twice over the WR list always result in inferior
performance compared to walking once over this list ? Both the iSER
protocol and the SRP protocol allow to send large sg lists (e.g.
containing 128 elements) at once over the wire. When using
asynchronous (buffered) I/O, this maximum is often reached. One
interesting performance optimization is to send all 128 sg elements at
once using one ib_post_send() call and to request a completion
notification for the last WR only. But if the ib_post_send() call
returns an immediate error and has sent part of the WR list, no
completion notification will be received. So code that calls
ib_post_send() has to request a completion notification for each WR,
which has a negative performance impact. My opinion is that the
current behavior makes ib_post_send() easier to implement, while the
behavior specified in the IBAS is more interesting for applications
that use the verbs API.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ib_post_send in drivers

Reply via email to