ULP can post recvs before connection is established but not to send queue prior to connection establishment.
Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Glenn Grundstrom [mailto:[EMAIL PROTECTED] > Sent: Sunday, November 25, 2007 9:00 PM > To: Steve Wise; Kanevsky, Arkady > Cc: Leonid Grossman; [EMAIL PROTECTED] > Subject: RE: [ofa-general] Re: iWARP peer-to-peer CM proposal > > > > > Kanevsky, Arkady wrote: > > > Very good points. > > > Thanks Steve. > > > > > > If we can do unsignalled 0-size RDMA Read with "bogus" > > S-tag this may > > > work better. > > > Yes, it will require IRD not to be 0 set at Responder. > > > Ditto ORD of at least 1 on Responder. > > > There is no need to have extra CQ entry on either side for it. > > > It is only needed for error path. > > > So this will only be needed if Sender posted the full queue > > of sends. > > > But it can not post anything because CM will not let it know that > > > connection is established. > > > > > > > > Well, actually, I think the ULP _can_ post before establishing the > > connection. But I guess we can define the semantics such that > > applications using the rdma-cm interface must adhere to whatever we > > need to make this hack work. > > > > Q: are there apps using the rdma-cm out there today that > pre-post SQ > > WRs before getting a ESTABLISHED event? > > > > Steve. > > ULPs are allowed to post prior to establishing the > connection, but I can't name any that operate this way. > Prohibiting applications that use the rdma_cm directly from > pre-posting is okay, but what about ULP's over other ULP's > (i.e. MPI over uDAPL). How can/will this be handled? > > Glenn. > > > > > Happy Thanksgiving, > > > > > > Arkady Kanevsky email: [EMAIL PROTECTED] > > > Network Appliance Inc. phone: 781-768-5395 > > > 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > > > Waltham, MA 02451 central phone: 781-768-5300 > > > > > > > > > > > >> -----Original Message----- > > >> From: Steve Wise [mailto:[EMAIL PROTECTED] > > >> Sent: Wednesday, November 21, 2007 1:07 PM > > >> To: Kanevsky, Arkady > > >> Cc: Glenn Grundstrom; Leonid Grossman; [EMAIL PROTECTED] > > >> Subject: [ofa-general] Re: iWARP peer-to-peer CM proposal > > >> > > >> Comments in-line below... > > >> > > >> > > >> Kanevsky, Arkady wrote: > > >> > > >>> Group, > > >>> > > >>> > > >>> below is proposal on how to resolve peer-to-peer > > iWARP CM issue > > >>> discovered at interop event. > > >>> > > >>> > > >>> The main issue is that MPA spec (relevant portion of > > >>> > > >> IETF RFC 5044 > > >> > > >>> is below) require that > > >>> > > >>> > > >>> connection initiator send first message over the > > >>> > > >> established connection. > > >> > > >>> Multiple MPI implementations and several other apps use > > >>> > > >> peer-to-peer > > >> > > >>> model. > > >>> > > >>> > > >>> So rather then forcing all of them to do it on their > > >>> > > >> own, which will > > >> > > >>> not help with > > >>> > > >>> > > >>> interop between different implementations, the goal > > is to extend > > >>> lower layers to provide it. > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> Our first idea was to leave MPA protocol untouched and > > >>> > > >> try to solve > > >> > > >>> this problem > > >>> > > >>> > > >>> in iw_cm. But there are too many complications to it. > > First, in > > >>> order to adhere to RFC5044 > > >>> > > >>> > > >>> initiator must send first FPDU and responder process > > >>> > > >> it. But since > > >> > > >>> the connection is already > > >>> > > >>> > > >>> established processing FPDU involves ULP on whose behalf the > > >>> connection is created. > > >>> > > >>> > > >>> So either initiator sends a message which generates > > >>> > > >> completion on > > >> > > >>> responder CQ, thus visible > > >>> > > >>> > > >>> to ULP, or not. > > >>> > > >> > > >> > > >>> In the later case, the only op which can do it is > > >>> RDMA one, which means > > >>> > > >>> > > >>> that responder somehow provided initiator S-tag which > > >>> > > >> it can use. > > >> > > >>> So, this is an extension > > >>> > > >>> > > >>> to MPA, probably using private data. And that responder upon > > >>> receiving it destroy this S-tag. > > >>> > > >>> > > >>> In any case this is an extension of MPA. > > >>> > > >>> > > >> This stag exchange isn't needed if this RDMA op is a 0B READ. > > >> The responder waits for that 0B read and only indicates > the rdma > > >> connection is established to its ULP when it replies to the 0B > > >> read. In this scenario, the responder/server side > doesn't consume > > >> any CQ resources. > > >> But it would require an IRD of at least 1 to be configured > > on the QP. > > >> The initiator still requires an SQ entry, and possibly a > CQ entry, > > >> for initiating the 0B read and handling completion. > > >> But its perhaps a little less painful than doing a SEND/RECV > > >> exchange. The read wr could be unsignaled so that it won't > > >> generate a CQE. But it still consumes an SQ WR slot so the SQ > > >> would have to be sized to allow this extra WR. And I > guess the CQ > > >> would also need to be sized accordingly in case the read failed. > > >> > > >> > > >>> In the former, Send is used but this requires a buffer > > >>> > > >> to be posted > > >> > > >>> to CQ. But since > > >>> > > >>> > > >>> the same CQ (or SharedCQ) can be used by other > > >>> > > >> connections at the > > >> > > >>> same time it can cause > > >>> > > >>> > > >>> the responder CM posted buffer to be consumed by other > > >>> > > >> connection. > > >> > > >>> This is not acceptable. > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> So new we consider extension to MPA protocol. > > >>> > > >>> > > >>> The goal is to be completely backwards compatible to > > >>> > > >> existing version 1. > > >> > > >>> In a nutshell, use a "flag" in the MPA request message which > > >>> indicates that > > >>> > > >>> > > >>> "ready to receive" message will be send by requestor upon > > >>> receiving > > >>> > > >>> > > >>> MPA response message with connection acceptance. > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> here are the changes to IETF RFC5044 > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> 1. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 > > >>> > > >> 2 3 4 5 6 7 8 > > >> > > >>> 9 0 1 > > >>> > > >>> > > >> > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 > > >> > > >>> | | + Key (16 bytes containing "MPA ID Req Frame") + 4 > > >>> > > >> | (4D 50 41 > > >> > > >>> 20 49 44 20 52 65 71 20 46 72 61 6D 65) | + Or (16 > > >>> > > >> bytes containing > > >> > > >>> "MPA ID Rep Frame") + 8 | (4D 50 41 20 49 44 20 52 65 > > >>> > > >> 70 20 46 72 61 > > >> > > >>> 6D 65) | + Or (16 bytes containing "MPA ID Rtr Frame") > > >>> > > >> + 12 | (4D 50 > > >> > > >>> 41 20 49 44 20 52 74 52 20 46 72 61 6D 65) | + > > >>> > > >>> > > >> > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16 > > >> > > >>> |M|C|R|S| Res | Rev | PD_Length | > > >>> > > >>> > > >> > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | > > >> > > >>> | ~ ~ ~ Private Data ~ | | | > > >>> > > >> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | > > >> > > >>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> 2. S: indicator in the Req frame whether or not > > >>> > > >> Requestor will send > > >> > > >>> Rtr frame. > > >>> > > >>> > > >>> In Req frame, if set to 1 then Rtr frame will > > be sent if > > >>> responder > > >>> > > >>> > > >>> sends Rep frame with accept bit set. 0 indicate > > >>> > > >> that Rtr frame > > >> > > >>> will not be sent. > > >>> > > >>> > > >>> In Rep frame, 0 means that Responder cannot support > > >>> > > >> Rtr frame, > > >> > > >>> while 1 that it is and is waiting for it. > > >>> > > >>> > > >>> (While my preference is to handle this as MPA > > >>> > > >> protocol version > > >> > > >>> matching rules, > > >>> > > >>> > > >>> proposed method will provide complete backwards > > >>> > > >> compatibility) > > >> > > >>> Unused by Rtr frame. That is set to 0 in Rtr frame > > >>> > > >> and ignored > > >> > > >>> by responder. > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> All other bits M,C,R and remainder of Res treated > > >>> > > >> as in MPA ver 1. > > >> > > >>> > > >>> > > >>> > > >>> Rtr frame adhere to C bit as specified in Rep frame > > >>> > > >>> > > >>> > > >> First, the RTR frame _must_ be an FPDU for this to work. > > >> Thus it violates the DDP/RDMAP specs because it is an known > > >> DDP/RDMAP opcode. > > >> > > >> Second, assuming the RTR frame is sent as an FPDU, then > this won't > > >> work with existing RNIC HW. The HW will post an async error > > >> because the incoming DDP/RDMAP opcode is unknown. > > >> > > >> The only way I see that we can fix this for the existing > rnic HW is > > >> to come up with some way to send a valid RDMAP message from the > > >> initiator to the responder under the covers -and- have the > > >> responder only indicate that the connection is established when > > >> that FPDU is received. > > >> > > >> Chelsio cannot support this hack via a 0B write, but the could > > >> support a 0B read or send/recv exchange. But as you > indicate, this > > >> is very painful and perhaps impossible to do without > impacting the > > >> ULP and breaking verbs semantics. > > >> > > >> (that's why we punted on this a year ago :) > > >> > > >> > > >> Steve. > > >> > > >> _______________________________________________ > > >> general mailing list > > >> [email protected] > > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > >> > > >> To unsubscribe, please visit > > >> http://openib.org/mailman/listinfo/openib-general > > >> > > >> > > > _______________________________________________ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
