> > Kanevsky, Arkady wrote: > > Very good points. > > Thanks Steve. > > > > If we can do unsignalled 0-size RDMA Read with "bogus" > S-tag this may > > work better. > > Yes, it will require IRD not to be 0 set at Responder. > > Ditto ORD of at least 1 on Responder. > > There is no need to have extra CQ entry on either side for it. > > It is only needed for error path. > > So this will only be needed if Sender posted the full queue > of sends. > > But it can not post anything because CM will not let it know that > > connection is established. > > > > > Well, actually, I think the ULP _can_ post before establishing the > connection. But I guess we can define the semantics such that > applications using the rdma-cm interface must adhere to > whatever we need > to make this hack work. > > Q: are there apps using the rdma-cm out there today that > pre-post SQ WRs > before getting a ESTABLISHED event? > > Steve.
ULPs are allowed to post prior to establishing the connection, but I can't name any that operate this way. Prohibiting applications that use the rdma_cm directly from pre-posting is okay, but what about ULP's over other ULP's (i.e. MPI over uDAPL). How can/will this be handled? Glenn. > > Happy Thanksgiving, > > > > Arkady Kanevsky email: [EMAIL PROTECTED] > > Network Appliance Inc. phone: 781-768-5395 > > 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > > Waltham, MA 02451 central phone: 781-768-5300 > > > > > > > >> -----Original Message----- > >> From: Steve Wise [mailto:[EMAIL PROTECTED] > >> Sent: Wednesday, November 21, 2007 1:07 PM > >> To: Kanevsky, Arkady > >> Cc: Glenn Grundstrom; Leonid Grossman; [EMAIL PROTECTED] > >> Subject: [ofa-general] Re: iWARP peer-to-peer CM proposal > >> > >> Comments in-line below... > >> > >> > >> Kanevsky, Arkady wrote: > >> > >>> Group, > >>> > >>> > >>> below is proposal on how to resolve peer-to-peer > iWARP CM issue > >>> discovered at interop event. > >>> > >>> > >>> The main issue is that MPA spec (relevant portion of > >>> > >> IETF RFC 5044 > >> > >>> is below) require that > >>> > >>> > >>> connection initiator send first message over the > >>> > >> established connection. > >> > >>> Multiple MPI implementations and several other apps use > >>> > >> peer-to-peer > >> > >>> model. > >>> > >>> > >>> So rather then forcing all of them to do it on their > >>> > >> own, which will > >> > >>> not help with > >>> > >>> > >>> interop between different implementations, the goal > is to extend > >>> lower layers to provide it. > >>> > >>> > >>> > >>> > >>> > >>> Our first idea was to leave MPA protocol untouched and > >>> > >> try to solve > >> > >>> this problem > >>> > >>> > >>> in iw_cm. But there are too many complications to it. > First, in > >>> order to adhere to RFC5044 > >>> > >>> > >>> initiator must send first FPDU and responder process > >>> > >> it. But since > >> > >>> the connection is already > >>> > >>> > >>> established processing FPDU involves ULP on whose behalf the > >>> connection is created. > >>> > >>> > >>> So either initiator sends a message which generates > >>> > >> completion on > >> > >>> responder CQ, thus visible > >>> > >>> > >>> to ULP, or not. > >>> > >> > >> > >>> In the later case, the only op which can do it is > >>> RDMA one, which means > >>> > >>> > >>> that responder somehow provided initiator S-tag which > >>> > >> it can use. > >> > >>> So, this is an extension > >>> > >>> > >>> to MPA, probably using private data. And that responder upon > >>> receiving it destroy this S-tag. > >>> > >>> > >>> In any case this is an extension of MPA. > >>> > >>> > >> This stag exchange isn't needed if this RDMA op is a 0B READ. > >> The responder waits for that 0B read and only indicates the > >> rdma connection is established to its ULP when it replies to > >> the 0B read. In this scenario, the responder/server side > >> doesn't consume any CQ resources. > >> But it would require an IRD of at least 1 to be configured > on the QP. > >> The initiator still requires an SQ entry, and possibly a CQ > >> entry, for initiating the 0B read and handling completion. > >> But its perhaps a little less painful than doing a SEND/RECV > >> exchange. The read wr could be unsignaled so that it won't > >> generate a CQE. But it still consumes an SQ WR slot so the > >> SQ would have to be sized to allow this extra WR. And I guess > >> the CQ would also need to be sized accordingly in case the > >> read failed. > >> > >> > >>> In the former, Send is used but this requires a buffer > >>> > >> to be posted > >> > >>> to CQ. But since > >>> > >>> > >>> the same CQ (or SharedCQ) can be used by other > >>> > >> connections at the > >> > >>> same time it can cause > >>> > >>> > >>> the responder CM posted buffer to be consumed by other > >>> > >> connection. > >> > >>> This is not acceptable. > >>> > >>> > >>> > >>> > >>> > >>> So new we consider extension to MPA protocol. > >>> > >>> > >>> The goal is to be completely backwards compatible to > >>> > >> existing version 1. > >> > >>> In a nutshell, use a "flag" in the MPA request message which > >>> indicates that > >>> > >>> > >>> "ready to receive" message will be send by requestor upon > >>> receiving > >>> > >>> > >>> MPA response message with connection acceptance. > >>> > >>> > >>> > >>> > >>> > >>> here are the changes to IETF RFC5044 > >>> > >>> > >>> > >>> > >>> > >>> 1. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 > >>> > >> 2 3 4 5 6 7 8 > >> > >>> 9 0 1 > >>> > >>> > >> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 > >> > >>> | | + Key (16 bytes containing "MPA ID Req Frame") + 4 > >>> > >> | (4D 50 41 > >> > >>> 20 49 44 20 52 65 71 20 46 72 61 6D 65) | + Or (16 > >>> > >> bytes containing > >> > >>> "MPA ID Rep Frame") + 8 | (4D 50 41 20 49 44 20 52 65 > >>> > >> 70 20 46 72 61 > >> > >>> 6D 65) | + Or (16 bytes containing "MPA ID Rtr Frame") > >>> > >> + 12 | (4D 50 > >> > >>> 41 20 49 44 20 52 74 52 20 46 72 61 6D 65) | + > >>> > >>> > >> > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16 > >> > >>> |M|C|R|S| Res | Rev | PD_Length | > >>> > >>> > >> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | > >> > >>> | ~ ~ ~ Private Data ~ | | | > >>> > >> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | > >> > >>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > >>> > >>> > >>> > >>> > >>> > >>> 2. S: indicator in the Req frame whether or not > >>> > >> Requestor will send > >> > >>> Rtr frame. > >>> > >>> > >>> In Req frame, if set to 1 then Rtr frame will > be sent if > >>> responder > >>> > >>> > >>> sends Rep frame with accept bit set. 0 indicate > >>> > >> that Rtr frame > >> > >>> will not be sent. > >>> > >>> > >>> In Rep frame, 0 means that Responder cannot support > >>> > >> Rtr frame, > >> > >>> while 1 that it is and is waiting for it. > >>> > >>> > >>> (While my preference is to handle this as MPA > >>> > >> protocol version > >> > >>> matching rules, > >>> > >>> > >>> proposed method will provide complete backwards > >>> > >> compatibility) > >> > >>> Unused by Rtr frame. That is set to 0 in Rtr frame > >>> > >> and ignored > >> > >>> by responder. > >>> > >>> > >>> > >>> > >>> > >>> All other bits M,C,R and remainder of Res treated > >>> > >> as in MPA ver 1. > >> > >>> > >>> > >>> > >>> Rtr frame adhere to C bit as specified in Rep frame > >>> > >>> > >>> > >> First, the RTR frame _must_ be an FPDU for this to work. > >> Thus it violates the DDP/RDMAP specs because it is an known > >> DDP/RDMAP opcode. > >> > >> Second, assuming the RTR frame is sent as an FPDU, then this > >> won't work with existing RNIC HW. The HW will post an async > >> error because the incoming DDP/RDMAP opcode is unknown. > >> > >> The only way I see that we can fix this for the existing rnic > >> HW is to come up with some way to send a valid RDMAP message > >> from the initiator to the responder under the covers -and- > >> have the responder only indicate that the connection is > >> established when that FPDU is received. > >> > >> Chelsio cannot support this hack via a 0B write, but the > >> could support a 0B read or send/recv exchange. But as you > >> indicate, this is very painful and perhaps impossible to do > >> without impacting the ULP and breaking verbs semantics. > >> > >> (that's why we punted on this a year ago :) > >> > >> > >> Steve. > >> > >> _______________________________________________ > >> general mailing list > >> [email protected] > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >> To unsubscribe, please visit > >> http://openib.org/mailman/listinfo/openib-general > >> > >> > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
