RE: [ofa-general] Re: iWARP peer-to-peer CM proposal
Agree with initiator/client sending signalled 0B RDMA Read. This will handle client side. Still not 100% clear on passive/server side. Two issues which bothers me. 1. Is bogus S-tag allowed for incomming RDMA ops? I do not recall that RDDP requires that length is checked before S-tag. 2. How is verb layer on server side knows that RDMA Read op came and was done? Is it some back door to vendor FW? Will this be kicked for all incoming RDMA Read ops? Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 27, 2007 7:48 PM To: Caitlin Bestler Cc: Kanevsky, Arkady; Glenn Grundstrom; Leonid Grossman; [EMAIL PROTECTED] Subject: Re: [ofa-general] Re: iWARP peer-to-peer CM proposal Caitlin Bestler wrote: On Nov 27, 2007 3:58 PM, Steve Wise [EMAIL PROTECTED] wrote: For the short term, I claim we just implement this as part of linux iwarp connection setup (mandating a 0B read be sent from the active side). Your proposal to add meta-data to the private data requires a standards change anyway and is, IMO, the 2nd phase of this whole enchilada... Steve. I don't see how you can have any solution here that does not require meta-data. For non-peer-to-peer connections neither a zero length RDMA Read or Write should be sent. An extraneous RDMA Read is particularly onerous for a short lived connection that fits the classic active/passive model. So *something* is telling the CMA layer that this connection may need an MPA unjam action. If that isn't meta-data, what is it? I assumed the 0B read would _always_ be sent as part of establishing an iWARP connection using linux and the rdma-cm. Further, the RDMA Read solution is adequate whenever the RDMA Write solution would have been (although at an unnecessary extra cost), but as near as I can determine it is not a complete solution. If the passive side needs an untagged message completion then *something* needs to send it. How can the CM layer (or, I suppose, the ULP itself) know that this untagged NOP message must be sent without meta-data? I believe at Reno we had the current rnic vendors all saying a SEND or 0B read will work. So: If someone has current iwarp HW that will _not_ handle this problem by doing the 0B read hack, please speak up now. As I see it, if we want to do the minimum that is required, but be certain that it is adequate, we need a per-connection setup meta-data exchange. Are you going to prototype this? Steve. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: iWARP peer-to-peer CM proposal
Any posting to SQ prior to connection establishment will complete immideately with the flashed status. Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Glenn Grundstrom [mailto:[EMAIL PROTECTED] Sent: Sunday, November 25, 2007 9:00 PM To: Steve Wise; Kanevsky, Arkady Cc: Leonid Grossman; [EMAIL PROTECTED] Subject: RE: [ofa-general] Re: iWARP peer-to-peer CM proposal Kanevsky, Arkady wrote: Very good points. Thanks Steve. If we can do unsignalled 0-size RDMA Read with bogus S-tag this may work better. Yes, it will require IRD not to be 0 set at Responder. Ditto ORD of at least 1 on Responder. There is no need to have extra CQ entry on either side for it. It is only needed for error path. So this will only be needed if Sender posted the full queue of sends. But it can not post anything because CM will not let it know that connection is established. Well, actually, I think the ULP _can_ post before establishing the connection. But I guess we can define the semantics such that applications using the rdma-cm interface must adhere to whatever we need to make this hack work. Q: are there apps using the rdma-cm out there today that pre-post SQ WRs before getting a ESTABLISHED event? Steve. ULPs are allowed to post prior to establishing the connection, but I can't name any that operate this way. Prohibiting applications that use the rdma_cm directly from pre-posting is okay, but what about ULP's over other ULP's (i.e. MPI over uDAPL). How can/will this be handled? Glenn. Happy Thanksgiving, Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 21, 2007 1:07 PM To: Kanevsky, Arkady Cc: Glenn Grundstrom; Leonid Grossman; [EMAIL PROTECTED] Subject: [ofa-general] Re: iWARP peer-to-peer CM proposal Comments in-line below... Kanevsky, Arkady wrote: Group, below is proposal on how to resolve peer-to-peer iWARP CM issue discovered at interop event. The main issue is that MPA spec (relevant portion of IETF RFC 5044 is below) require that connection initiator send first message over the established connection. Multiple MPI implementations and several other apps use peer-to-peer model. So rather then forcing all of them to do it on their own, which will not help with interop between different implementations, the goal is to extend lower layers to provide it. Our first idea was to leave MPA protocol untouched and try to solve this problem in iw_cm. But there are too many complications to it. First, in order to adhere to RFC5044 initiator must send first FPDU and responder process it. But since the connection is already established processing FPDU involves ULP on whose behalf the connection is created. So either initiator sends a message which generates completion on responder CQ, thus visible to ULP, or not. In the later case, the only op which can do it is RDMA one, which means that responder somehow provided initiator S-tag which it can use. So, this is an extension to MPA, probably using private data. And that responder upon receiving it destroy this S-tag. In any case this is an extension of MPA. This stag exchange isn't needed if this RDMA op is a 0B READ. The responder waits for that 0B read and only indicates the rdma connection is established to its ULP when it replies to the 0B read. In this scenario, the responder/server side doesn't consume any CQ resources. But it would require an IRD of at least 1 to be configured on the QP. The initiator still requires an SQ entry, and possibly a CQ entry, for initiating the 0B read and handling completion. But its perhaps a little less painful than doing a SEND/RECV exchange. The read wr could be unsignaled so that it won't generate a CQE. But it still consumes an SQ WR slot so the SQ would have to be sized to allow this extra WR. And I
RE: [ofa-general] Re: iWARP peer-to-peer CM proposal
-Original Message- From: [EMAIL PROTECTED] [mailto:general- [EMAIL PROTECTED] On Behalf Of Kanevsky, Arkady Sent: Wednesday, November 28, 2007 5:30 AM To: Steve Wise; Caitlin Bestler Cc: Leonid Grossman; [EMAIL PROTECTED] Subject: RE: [ofa-general] Re: iWARP peer-to-peer CM proposal Agree with initiator/client sending signalled 0B RDMA Read. This will handle client side. Still not 100% clear on passive/server side. Two issues which bothers me. 1. Is bogus S-tag allowed for incomming RDMA ops? I do not recall that RDDP requires that length is checked before S-tag. 2. How is verb layer on server side knows that RDMA Read op came and was done? Is it some back door to vendor FW? Will this be kicked for all incoming RDMA Read ops? As you point out, the server Verbs layer is not aware of an incoming 0B RDMA Read (or Write for that matter). Hence some kind of magic must happen in the adapter where we vendors will have a choice: a) just 'unjam' the SQ in the adapter (which means that the CM layer works as today and the server can post SQ ops before the 'unjam' is received but they won't make it to the wire) or b) send a back-door command to the CM which can then move the state machine to established only after the 'unjam' is received. Whatever is done, it cannot happen for all zero-length RDMA Read (or Write for that matter). Hence the adapter must be informed that that the next zero-length is the 'unjam' message (which also means that the server side could, in theory, omit sending the RDMA Read Response, because the RDMA Read Request was really a 'unjam'... not that I would be pushing for such an 'optimization' to avoid an extra wire message). Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 27, 2007 7:48 PM To: Caitlin Bestler Cc: Kanevsky, Arkady; Glenn Grundstrom; Leonid Grossman; [EMAIL PROTECTED] Subject: Re: [ofa-general] Re: iWARP peer-to-peer CM proposal Caitlin Bestler wrote: On Nov 27, 2007 3:58 PM, Steve Wise [EMAIL PROTECTED] wrote: For the short term, I claim we just implement this as part of linux iwarp connection setup (mandating a 0B read be sent from the active side). Your proposal to add meta-data to the private data requires a standards change anyway and is, IMO, the 2nd phase of this whole enchilada... Steve. I don't see how you can have any solution here that does not require meta-data. For non-peer-to-peer connections neither a zero length RDMA Read or Write should be sent. An extraneous RDMA Read is particularly onerous for a short lived connection that fits the classic active/passive model. So *something* is telling the CMA layer that this connection may need an MPA unjam action. If that isn't meta-data, what is it? I assumed the 0B read would _always_ be sent as part of establishing an iWARP connection using linux and the rdma-cm. Further, the RDMA Read solution is adequate whenever the RDMA Write solution would have been (although at an unnecessary extra cost), but as near as I can determine it is not a complete solution. If the passive side needs an untagged message completion then *something* needs to send it. How can the CM layer (or, I suppose, the ULP itself) know that this untagged NOP message must be sent without meta-data? I believe at Reno we had the current rnic vendors all saying a SEND or 0B read will work. So: If someone has current iwarp HW that will _not_ handle this problem by doing the 0B read hack, please speak up now. As I see it, if we want to do the minimum that is required, but be certain that it is adequate, we need a per-connection setup meta-data exchange. Are you going to prototype this? Steve. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: iWARP peer-to-peer CM proposal
-Original Message- From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 27, 2007 4:48 PM To: Caitlin Bestler Cc: Kanevsky, Arkady; Glenn Grundstrom; Leonid Grossman; openib- [EMAIL PROTECTED] Subject: Re: [ofa-general] Re: iWARP peer-to-peer CM proposal Caitlin Bestler wrote: On Nov 27, 2007 3:58 PM, Steve Wise [EMAIL PROTECTED] wrote: For the short term, I claim we just implement this as part of linux iwarp connection setup (mandating a 0B read be sent from the active side). Your proposal to add meta-data to the private data requires a standards change anyway and is, IMO, the 2nd phase of this whole enchilada... Steve. I don't see how you can have any solution here that does not require meta-data. For non-peer-to-peer connections neither a zero length RDMA Read or Write should be sent. An extraneous RDMA Read is particularly onerous for a short lived connection that fits the classic active/passive model. So *something* is telling the CMA layer that this connection may need an MPA unjam action. If that isn't meta-data, what is it? I assumed the 0B read would _always_ be sent as part of establishing an iWARP connection using linux and the rdma-cm. That is an extra round-trip per connection setup, which is a significant penalty for a short lived connection. It is trivial for HPC/peer-to-peer applications, but would be a killer for something like HTTP over RDMA. Doing something like this for *every* connection makes it effectively a change to the MPA protocol. OFA is not the forum for such discussions, the IETF is. OFA drafting an understanding of how peer-to-peer applications use the existing protocol, on the other hand, is quite reasonable. But it has to be something done by peer-to-peer middleware or by the verbs layer in response to a flag from the peer-to-peer middleware. Otherwise it is not augmenting a protocol, it is changing it. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: iWARP peer-to-peer CM proposal
Caitlin Bestler wrote: -Original Message- From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 27, 2007 4:48 PM To: Caitlin Bestler Cc: Kanevsky, Arkady; Glenn Grundstrom; Leonid Grossman; openib- [EMAIL PROTECTED] Subject: Re: [ofa-general] Re: iWARP peer-to-peer CM proposal Caitlin Bestler wrote: On Nov 27, 2007 3:58 PM, Steve Wise [EMAIL PROTECTED] wrote: For the short term, I claim we just implement this as part of linux iwarp connection setup (mandating a 0B read be sent from the active side). Your proposal to add meta-data to the private data requires a standards change anyway and is, IMO, the 2nd phase of this whole enchilada... Steve. I don't see how you can have any solution here that does not require meta-data. For non-peer-to-peer connections neither a zero length RDMA Read or Write should be sent. An extraneous RDMA Read is particularly onerous for a short lived connection that fits the classic active/passive model. So *something* is telling the CMA layer that this connection may need an MPA unjam action. If that isn't meta-data, what is it? I assumed the 0B read would _always_ be sent as part of establishing an iWARP connection using linux and the rdma-cm. That is an extra round-trip per connection setup, which is a significant penalty for a short lived connection. It is trivial for HPC/peer-to-peer applications, but would be a killer for something like HTTP over RDMA. Doing something like this for *every* connection makes it effectively a change to the MPA protocol. OFA is not the forum for such discussions, the IETF is. OFA drafting an understanding of how peer-to-peer applications use the existing protocol, on the other hand, is quite reasonable. But it has to be something done by peer-to-peer middleware or by the verbs layer in response to a flag from the peer-to-peer middleware. Otherwise it is not augmenting a protocol, it is changing it. posting a 0B read after the mpa setup isn't changing the MPA protocol. Its adding a protocol on top of the MPA setup in order to meet the requirements of the MPA protocol. Whether you add a private-data request for this or _assume_ the 0B read will happen doesn't change this. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: iWARP peer-to-peer CM proposal
On Wed, 2007-11-28 at 11:43 -0500, Caitlin Bestler wrote: -Original Message- From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 27, 2007 4:48 PM To: Caitlin Bestler Cc: Kanevsky, Arkady; Glenn Grundstrom; Leonid Grossman; openib- [EMAIL PROTECTED] Subject: Re: [ofa-general] Re: iWARP peer-to-peer CM proposal Caitlin Bestler wrote: On Nov 27, 2007 3:58 PM, Steve Wise [EMAIL PROTECTED] wrote: For the short term, I claim we just implement this as part of linux iwarp connection setup (mandating a 0B read be sent from the active side). Your proposal to add meta-data to the private data requires a standards change anyway and is, IMO, the 2nd phase of this whole enchilada... Steve. I don't see how you can have any solution here that does not require meta-data. For non-peer-to-peer connections neither a zero length RDMA Read or Write should be sent. An extraneous RDMA Read is particularly onerous for a short lived connection that fits the classic active/passive model. So *something* is telling the CMA layer that this connection may need an MPA unjam action. If that isn't meta-data, what is it? I assumed the 0B read would _always_ be sent as part of establishing an iWARP connection using linux and the rdma-cm. That is an extra round-trip per connection setup, which is a significant penalty for a short lived connection. It is trivial for HPC/peer-to-peer applications, but would be a killer for something like HTTP over RDMA. I find it hard to get excited about optimizing short lived connections for RDMA. I simply don't think it's an interesting use case. And btw, HTTP long ago got rid of short lived connections because it's painful even on TCP. Doing something like this for *every* connection makes it effectively a change to the MPA protocol. Uh. No, it doesn't. Normalizing the behavior of applications during connection setup doesn't change the underlying protocol. It adds another one on top. OFA is not the forum for such discussions, the IETF is. My living room, the dinner table, the local bar and this mailing list are perfectly acceptable forums for discussing a protocol. The IETF is the forum for standardizing one. Right now, I don't think we're ready to standardize, because we're still exploring the options; the first of which is NOT changing MPA. This group has the unique benefit of actually USING and IMPLEMENTING the protocol and therefore has some beneficial insights that may and should be shared. All that said revving the MPA protocol is way down the road. OFA drafting an understanding of how peer-to-peer applications use the existing protocol, on the other hand, is quite reasonable. That's step 1 and the 0B READ is one way to do it. But it has to be something done by peer-to-peer middleware or by the verbs layer in response to a flag from the peer-to-peer middleware. Otherwise it is not augmenting a protocol, it is changing it. The flag may be useful, however, I don't see the connection between the flag and complying with the MPA protocol. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: iWARP peer-to-peer CM proposal
Kanevsky, Arkady wrote: ULP can post recvs before connection is established but not to send queue prior to connection establishment. I hate quoting specs (and the RDMAC verbs spec isn't really any standard), but, page 25 of draft-hilland-iwarp-verbs-v1.0 indicates its ok to post SQ WRs when in idle: The QP MUST be in the Idle state following QP creation or when moved to this state with Modify QP. In this state, Send or Receive WRs MAY be posted but they MUST NOT be processed and CQEs MUST NOT be generated. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: iWARP peer-to-peer CM proposal
Kanevsky, Arkady wrote: Agree with initiator/client sending signalled 0B RDMA Read. This will handle client side. Still not 100% clear on passive/server side. Two issues which bothers me. 1. Is bogus S-tag allowed for incomming RDMA ops? The stag/to must not be validated if the incoming read is 0B length. http://www.ietf.org/rfc/rfc5040.txt: * If the Data Source receives an RDMA Read Request Header with the RDMA Read Message Size set to zero, the Data Source RDMAP: * MUST NOT validate the Data Source STag and Data Source Tagged Offset contained in the RDMA Read Request Header, and * MUST respond with a zero-length RDMA Read Response Message. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: iWARP peer-to-peer CM proposal
Another small discreptancy between IB and iWARP. Since RDMA_CM is used for ULP which are transport independent they will follow the stricter rule. That is IB. For IB any posting to SQ prior to QP being in RTS state shall be flushed. This semantic is actually very useful for ULPs which use insignalled completions. Because, once you see the completion for the request you posted after connection failure you are sure that all previously posted request on the same SQ are completed and had you had seen them all. So while, you are correct on the spec since we are working in IW_CM we can assume IB semantic on posting. Thanks, Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 28, 2007 1:52 PM To: Kanevsky, Arkady Cc: Glenn Grundstrom; Leonid Grossman; [EMAIL PROTECTED] Subject: Re: [ofa-general] Re: iWARP peer-to-peer CM proposal Kanevsky, Arkady wrote: ULP can post recvs before connection is established but not to send queue prior to connection establishment. I hate quoting specs (and the RDMAC verbs spec isn't really any standard), but, page 25 of draft-hilland-iwarp-verbs-v1.0 indicates its ok to post SQ WRs when in idle: The QP MUST be in the Idle state following QP creation or when moved to this state with Modify QP. In this state, Send or Receive WRs MAY be posted but they MUST NOT be processed and CQEs MUST NOT be generated. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: iWARP peer-to-peer CM proposal
On 11/29/07, Kanevsky, Arkady [EMAIL PROTECTED] wrote: So while, you are correct on the spec since we are working in IW_CM we can assume IB semantic on posting. please spend a minute on http://www.zip.com.au/~akpm/linux/patches/stuff/top-posting.txt Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: iWARP peer-to-peer CM proposal
ULP can post recvs before connection is established but not to send queue prior to connection establishment. Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Glenn Grundstrom [mailto:[EMAIL PROTECTED] Sent: Sunday, November 25, 2007 9:00 PM To: Steve Wise; Kanevsky, Arkady Cc: Leonid Grossman; [EMAIL PROTECTED] Subject: RE: [ofa-general] Re: iWARP peer-to-peer CM proposal Kanevsky, Arkady wrote: Very good points. Thanks Steve. If we can do unsignalled 0-size RDMA Read with bogus S-tag this may work better. Yes, it will require IRD not to be 0 set at Responder. Ditto ORD of at least 1 on Responder. There is no need to have extra CQ entry on either side for it. It is only needed for error path. So this will only be needed if Sender posted the full queue of sends. But it can not post anything because CM will not let it know that connection is established. Well, actually, I think the ULP _can_ post before establishing the connection. But I guess we can define the semantics such that applications using the rdma-cm interface must adhere to whatever we need to make this hack work. Q: are there apps using the rdma-cm out there today that pre-post SQ WRs before getting a ESTABLISHED event? Steve. ULPs are allowed to post prior to establishing the connection, but I can't name any that operate this way. Prohibiting applications that use the rdma_cm directly from pre-posting is okay, but what about ULP's over other ULP's (i.e. MPI over uDAPL). How can/will this be handled? Glenn. Happy Thanksgiving, Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 21, 2007 1:07 PM To: Kanevsky, Arkady Cc: Glenn Grundstrom; Leonid Grossman; [EMAIL PROTECTED] Subject: [ofa-general] Re: iWARP peer-to-peer CM proposal Comments in-line below... Kanevsky, Arkady wrote: Group, below is proposal on how to resolve peer-to-peer iWARP CM issue discovered at interop event. The main issue is that MPA spec (relevant portion of IETF RFC 5044 is below) require that connection initiator send first message over the established connection. Multiple MPI implementations and several other apps use peer-to-peer model. So rather then forcing all of them to do it on their own, which will not help with interop between different implementations, the goal is to extend lower layers to provide it. Our first idea was to leave MPA protocol untouched and try to solve this problem in iw_cm. But there are too many complications to it. First, in order to adhere to RFC5044 initiator must send first FPDU and responder process it. But since the connection is already established processing FPDU involves ULP on whose behalf the connection is created. So either initiator sends a message which generates completion on responder CQ, thus visible to ULP, or not. In the later case, the only op which can do it is RDMA one, which means that responder somehow provided initiator S-tag which it can use. So, this is an extension to MPA, probably using private data. And that responder upon receiving it destroy this S-tag. In any case this is an extension of MPA. This stag exchange isn't needed if this RDMA op is a 0B READ. The responder waits for that 0B read and only indicates the rdma connection is established to its ULP when it replies to the 0B read. In this scenario, the responder/server side doesn't consume any CQ resources. But it would require an IRD of at least 1 to be configured on the QP. The initiator still requires an SQ entry, and possibly a CQ entry, for initiating the 0B read and handling completion. But its perhaps a little less painful than doing a SEND/RECV exchange. The read wr could be unsignaled so that it won't generate a CQE. But it still consumes an SQ WR slot so the SQ would have to be sized to allow this extra WR
Re: [ofa-general] Re: iWARP peer-to-peer CM proposal
On Nov 27, 2007 6:54 AM, Kanevsky, Arkady [EMAIL PROTECTED] wrote: ULP can post recvs before connection is established but not to send queue prior to connection establishment. ULP can post sends only after it is notified that the connection is established. The issue is when the iWARP layer can issue this notification. If the MPA layer implements fencing on its own, then the notification can be provided immediately after the MPA Request/Response exchange. If not, it must wait for the first MPA frame. The problem is that implementations that adhere to closely to the RDMAC verbs can obtain no information about the connection unless there is a CQE producing event. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: iWARP peer-to-peer CM proposal
Caitlin Bestler wrote: On Nov 27, 2007 6:54 AM, Kanevsky, Arkady [EMAIL PROTECTED] wrote: ULP can post recvs before connection is established but not to send queue prior to connection establishment. ULP can post sends only after it is notified that the connection is established. The issue is when the iWARP layer can issue this notification. If the MPA layer implements fencing on its own, then the notification can be provided immediately after the MPA Request/Response exchange. If not, it must wait for the first MPA frame. The problem is that implementations that adhere to closely to the RDMAC verbs can obtain no information about the connection unless there is a CQE producing event. The idea for this hack is that the passive side (the side that sends the MPA response) will hold off posting the ESTABLISHED event to the rdma-cm ULP until after it receives this 0B Read Request from the client... ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: iWARP peer-to-peer CM proposal
The idea for this hack is that the passive side (the side that sends the MPA response) will hold off posting the ESTABLISHED event to the rdma-cm ULP until after it receives this 0B Read Request from the client... What is notifying the passive side that the active side has completed a read request, and that it's okay to start sending? Also, at least with IB, a QP be configured on creation to always generate a CQ entry for all WRs posted to the send queue. I don't know if iWarp follows this same model. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: iWARP peer-to-peer CM proposal
Sean Hefty wrote: The idea for this hack is that the passive side (the side that sends the MPA response) will hold off posting the ESTABLISHED event to the rdma-cm ULP until after it receives this 0B Read Request from the client... What is notifying the passive side that the active side has completed a read request, and that it's okay to start sending? The iwarp provider driver will only post the IW_CM_ESTABLISHED event after receiving the read request. For the Chelsio provider, this will require changes to the rnic firmware and the driver/library to support all this. I haven't thought through exactly how this should be implemented. For instance, the provider library poll function needs to deal with this 0B read completion and note that it is this special connection setup 0B read and thus hide the completion from the user call poll()... Also, at least with IB, a QP be configured on creation to always generate a CQ entry for all WRs posted to the send queue. I don't know if iWarp follows this same model. After thinking about this more, I think we do want to make this 0B read signaled. Then we can post the IW_CM_ESTABLISHED event on the client side when the read request completes. So from the RDMA application's perspective, the connection never gets setup until this 0B read is completed, and that's really what we want... Steve. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: iWARP peer-to-peer CM proposal
Caitlin Bestler wrote: On Nov 27, 2007 3:13 PM, Steve Wise [EMAIL PROTECTED] wrote: Caitlin Bestler wrote: On Nov 27, 2007 6:54 AM, Kanevsky, Arkady [EMAIL PROTECTED] wrote: ULP can post recvs before connection is established but not to send queue prior to connection establishment. ULP can post sends only after it is notified that the connection is established. The issue is when the iWARP layer can issue this notification. If the MPA layer implements fencing on its own, then the notification can be provided immediately after the MPA Request/Response exchange. If not, it must wait for the first MPA frame. The problem is that implementations that adhere to closely to the RDMAC verbs can obtain no information about the connection unless there is a CQE producing event. The idea for this hack is that the passive side (the side that sends the MPA response) will hold off posting the ESTABLISHED event to the rdma-cm ULP until after it receives this 0B Read Request from the client... The problem is that this solution is being applied at the wrong layer. MPA is not the source of the problem, but rather the RDMAC layer verbs. The solution needs to be a verb-layer solution, not an MPA layer solution. This isn't being solved at the MPA layer. It being solved as a protocol exchange done after the MPA exchanges (and after the connections are transitioned into FPDU mode. Remeber: This is a _hack_ to get our current generation of rnics to support peer-to-peer _without_ impacting the rdma applications (like IMPI and OMPI). Steve's last comment states the problem well: we are trying to enable the Verbs layer on the Passive side to generate the Established event, and if at all possible to do so in a way that places no requirements on the application layer. I believe it is possible to do so without making any modifications to MPA. Yes. The MPA protocol requirement is a safeguard against receiving an MPA Frame before the MPA Response frame. MPA does not have or need an RTR message, because the MPA RFC allows *any* MPA frame from the active side to effectively acknowledge receipt of the MPA Response. Yes, but it puts the onus on the ULP to deal with this. In our current implementation model, that ULP is the top end application. That includes a zero-length RDMA Write. An iWARP implementation can (perhaps SHOULD) implement an MPA Fenced state on the passive side that is cleared on receipt of any MPA frame. With such a MPA Fence feature, the CM layer can generate the Connection Established event as soon as it sends the MPA Response and the Passive-side ULP will be able to post to the SQ, the messages just won't go the wire until something is received. Meanwhile the Active Side must ensure that *some* MPA frame is sent immediately after the MPA Response is received. If it has traffic ready to go it can simply send that. If it does not, it can use a zero-length write. A zero-length write is totally transparent to the ULP at both ends. But that will only work for *some* implementations. On others a zero length RDMA Read is needed to unjam things. That's almost transparent, but not totally so since it temporarily uses an RDMA Read credit. Right. Chelsio needs a Read vs a Write because the FW and driver don't detect the incoming 0B write so they cannot drive the ESTABLISHED event on that. And while nobody has spoken up to say *they* have that problem, I would not be surprised if there are implementations where nothing less than a full ULP nop message will suffice. So keeping the fix at the verbs layer, and allowing the minimal extra effort to be controlled by the Passive layer itself, suggests that the Passive side simply encode its MPA-unjam-action-required in the OFA standardized portion of the Private Data. Encodings would include: - Any MPA Frame, including a zero-length RDMA Write will unjam the passive side SendQ. - An untagged message or a zero-length RDMA Read will work. - Only an untagged message will work. So you're advocating adding a standardized header to the private data to indicate what the passive side needs. While we're at it, lets add in ORD/IRD ;-) In the latter cases the middleware will have to play games with standin receive WQEs and only posting the actual receive WQEs to the QP after the MPA fence has been unjammed. That isn't pretty, but if your hardware is fixed then it's either that or make the application deal with the problem. I have a hunch that the MPI developers would not like that option at all. How this differs from what Arkady proposed is that it avoids making any changes to MPA, but instead only makes use of the OFA defined portion of the Private Data. Further it allows use of a zero-length RDMA Write when that is sufficient to break the MPA logjam. A zero-length RDMA Write, unlike a zero-length RDMA Read, is *totally* transparent to the ULP. For the short term, I claim we just implement this as part of linux iwarp connection
Re: [ofa-general] Re: iWARP peer-to-peer CM proposal
On Nov 27, 2007 3:58 PM, Steve Wise [EMAIL PROTECTED] wrote: For the short term, I claim we just implement this as part of linux iwarp connection setup (mandating a 0B read be sent from the active side). Your proposal to add meta-data to the private data requires a standards change anyway and is, IMO, the 2nd phase of this whole enchilada... Steve. I don't see how you can have any solution here that does not require meta-data. For non-peer-to-peer connections neither a zero length RDMA Read or Write should be sent. An extraneous RDMA Read is particularly onerous for a short lived connection that fits the classic active/passive model. So *something* is telling the CMA layer that this connection may need an MPA unjam action. If that isn't meta-data, what is it? Further, the RDMA Read solution is adequate whenever the RDMA Write solution would have been (although at an unnecessary extra cost), but as near as I can determine it is not a complete solution. If the passive side needs an untagged message completion then *something* needs to send it. How can the CM layer (or, I suppose, the ULP itself) know that this untagged NOP message must be sent without meta-data? As I see it, if we want to do the minimum that is required, but be certain that it is adequate, we need a per-connection setup meta-data exchange. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: iWARP peer-to-peer CM proposal
Caitlin Bestler wrote: On Nov 27, 2007 3:58 PM, Steve Wise [EMAIL PROTECTED] wrote: For the short term, I claim we just implement this as part of linux iwarp connection setup (mandating a 0B read be sent from the active side). Your proposal to add meta-data to the private data requires a standards change anyway and is, IMO, the 2nd phase of this whole enchilada... Steve. I don't see how you can have any solution here that does not require meta-data. For non-peer-to-peer connections neither a zero length RDMA Read or Write should be sent. An extraneous RDMA Read is particularly onerous for a short lived connection that fits the classic active/passive model. So *something* is telling the CMA layer that this connection may need an MPA unjam action. If that isn't meta-data, what is it? I assumed the 0B read would _always_ be sent as part of establishing an iWARP connection using linux and the rdma-cm. Further, the RDMA Read solution is adequate whenever the RDMA Write solution would have been (although at an unnecessary extra cost), but as near as I can determine it is not a complete solution. If the passive side needs an untagged message completion then *something* needs to send it. How can the CM layer (or, I suppose, the ULP itself) know that this untagged NOP message must be sent without meta-data? I believe at Reno we had the current rnic vendors all saying a SEND or 0B read will work. So: If someone has current iwarp HW that will _not_ handle this problem by doing the 0B read hack, please speak up now. As I see it, if we want to do the minimum that is required, but be certain that it is adequate, we need a per-connection setup meta-data exchange. Are you going to prototype this? Steve. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: iWARP peer-to-peer CM proposal
Kanevsky, Arkady wrote: Very good points. Thanks Steve. If we can do unsignalled 0-size RDMA Read with bogus S-tag this may work better. Yes, it will require IRD not to be 0 set at Responder. Ditto ORD of at least 1 on Responder. There is no need to have extra CQ entry on either side for it. It is only needed for error path. So this will only be needed if Sender posted the full queue of sends. But it can not post anything because CM will not let it know that connection is established. Well, actually, I think the ULP _can_ post before establishing the connection. But I guess we can define the semantics such that applications using the rdma-cm interface must adhere to whatever we need to make this hack work. Q: are there apps using the rdma-cm out there today that pre-post SQ WRs before getting a ESTABLISHED event? Steve. ULPs are allowed to post prior to establishing the connection, but I can't name any that operate this way. Prohibiting applications that use the rdma_cm directly from pre-posting is okay, but what about ULP's over other ULP's (i.e. MPI over uDAPL). How can/will this be handled? Glenn. Happy Thanksgiving, Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 21, 2007 1:07 PM To: Kanevsky, Arkady Cc: Glenn Grundstrom; Leonid Grossman; [EMAIL PROTECTED] Subject: [ofa-general] Re: iWARP peer-to-peer CM proposal Comments in-line below... Kanevsky, Arkady wrote: Group, below is proposal on how to resolve peer-to-peer iWARP CM issue discovered at interop event. The main issue is that MPA spec (relevant portion of IETF RFC 5044 is below) require that connection initiator send first message over the established connection. Multiple MPI implementations and several other apps use peer-to-peer model. So rather then forcing all of them to do it on their own, which will not help with interop between different implementations, the goal is to extend lower layers to provide it. Our first idea was to leave MPA protocol untouched and try to solve this problem in iw_cm. But there are too many complications to it. First, in order to adhere to RFC5044 initiator must send first FPDU and responder process it. But since the connection is already established processing FPDU involves ULP on whose behalf the connection is created. So either initiator sends a message which generates completion on responder CQ, thus visible to ULP, or not. In the later case, the only op which can do it is RDMA one, which means that responder somehow provided initiator S-tag which it can use. So, this is an extension to MPA, probably using private data. And that responder upon receiving it destroy this S-tag. In any case this is an extension of MPA. This stag exchange isn't needed if this RDMA op is a 0B READ. The responder waits for that 0B read and only indicates the rdma connection is established to its ULP when it replies to the 0B read. In this scenario, the responder/server side doesn't consume any CQ resources. But it would require an IRD of at least 1 to be configured on the QP. The initiator still requires an SQ entry, and possibly a CQ entry, for initiating the 0B read and handling completion. But its perhaps a little less painful than doing a SEND/RECV exchange. The read wr could be unsignaled so that it won't generate a CQE. But it still consumes an SQ WR slot so the SQ would have to be sized to allow this extra WR. And I guess the CQ would also need to be sized accordingly in case the read failed. In the former, Send is used but this requires a buffer to be posted to CQ. But since the same CQ (or SharedCQ) can be used by other connections at the same time it can cause the responder CM posted buffer to be consumed by other connection. This is not acceptable. So new we consider extension to MPA protocol. The goal is to be completely backwards compatible to existing version 1. In a nutshell, use a flag in the MPA request message which indicates that ready to receive message will be send
Re: [ofa-general] Re: iWARP peer-to-peer CM proposal
Kanevsky, Arkady wrote: Very good points. Thanks Steve. If we can do unsignalled 0-size RDMA Read with bogus S-tag this may work better. Yes, it will require IRD not to be 0 set at Responder. Ditto ORD of at least 1 on Responder. There is no need to have extra CQ entry on either side for it. It is only needed for error path. So this will only be needed if Sender posted the full queue of sends. But it can not post anything because CM will not let it know that connection is established. Well, actually, I think the ULP _can_ post before establishing the connection. But I guess we can define the semantics such that applications using the rdma-cm interface must adhere to whatever we need to make this hack work. Q: are there apps using the rdma-cm out there today that pre-post SQ WRs before getting a ESTABLISHED event? Steve. Happy Thanksgiving, Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 21, 2007 1:07 PM To: Kanevsky, Arkady Cc: Glenn Grundstrom; Leonid Grossman; [EMAIL PROTECTED] Subject: [ofa-general] Re: iWARP peer-to-peer CM proposal Comments in-line below... Kanevsky, Arkady wrote: Group, below is proposal on how to resolve peer-to-peer iWARP CM issue discovered at interop event. The main issue is that MPA spec (relevant portion of IETF RFC 5044 is below) require that connection initiator send first message over the established connection. Multiple MPI implementations and several other apps use peer-to-peer model. So rather then forcing all of them to do it on their own, which will not help with interop between different implementations, the goal is to extend lower layers to provide it. Our first idea was to leave MPA protocol untouched and try to solve this problem in iw_cm. But there are too many complications to it. First, in order to adhere to RFC5044 initiator must send first FPDU and responder process it. But since the connection is already established processing FPDU involves ULP on whose behalf the connection is created. So either initiator sends a message which generates completion on responder CQ, thus visible to ULP, or not. In the later case, the only op which can do it is RDMA one, which means that responder somehow provided initiator S-tag which it can use. So, this is an extension to MPA, probably using private data. And that responder upon receiving it destroy this S-tag. In any case this is an extension of MPA. This stag exchange isn't needed if this RDMA op is a 0B READ. The responder waits for that 0B read and only indicates the rdma connection is established to its ULP when it replies to the 0B read. In this scenario, the responder/server side doesn't consume any CQ resources. But it would require an IRD of at least 1 to be configured on the QP. The initiator still requires an SQ entry, and possibly a CQ entry, for initiating the 0B read and handling completion. But its perhaps a little less painful than doing a SEND/RECV exchange. The read wr could be unsignaled so that it won't generate a CQE. But it still consumes an SQ WR slot so the SQ would have to be sized to allow this extra WR. And I guess the CQ would also need to be sized accordingly in case the read failed. In the former, Send is used but this requires a buffer to be posted to CQ. But since the same CQ (or SharedCQ) can be used by other connections at the same time it can cause the responder CM posted buffer to be consumed by other connection. This is not acceptable. So new we consider extension to MPA protocol. The goal is to be completely backwards compatible to existing version 1. In a nutshell, use a flag in the MPA request message which indicates that ready to receive message will be send by requestor upon receiving MPA response message with connection acceptance. here are the changes to IETF RFC5044 1. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 | | + Key (16 bytes containing MPA ID Req Frame) + 4 | (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65) | + Or (16 bytes containing MPA ID Rep Frame) + 8 | (4D 50 41 20 49 44 20 52 65 70 20 46 72 61
RE: [ofa-general] Re: iWARP peer-to-peer CM proposal
Very good points. Thanks Steve. If we can do unsignalled 0-size RDMA Read with bogus S-tag this may work better. Yes, it will require IRD not to be 0 set at Responder. Ditto ORD of at least 1 on Responder. There is no need to have extra CQ entry on either side for it. It is only needed for error path. So this will only be needed if Sender posted the full queue of sends. But it can not post anything because CM will not let it know that connection is established. Happy Thanksgiving, Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 21, 2007 1:07 PM To: Kanevsky, Arkady Cc: Glenn Grundstrom; Leonid Grossman; [EMAIL PROTECTED] Subject: [ofa-general] Re: iWARP peer-to-peer CM proposal Comments in-line below... Kanevsky, Arkady wrote: Group, below is proposal on how to resolve peer-to-peer iWARP CM issue discovered at interop event. The main issue is that MPA spec (relevant portion of IETF RFC 5044 is below) require that connection initiator send first message over the established connection. Multiple MPI implementations and several other apps use peer-to-peer model. So rather then forcing all of them to do it on their own, which will not help with interop between different implementations, the goal is to extend lower layers to provide it. Our first idea was to leave MPA protocol untouched and try to solve this problem in iw_cm. But there are too many complications to it. First, in order to adhere to RFC5044 initiator must send first FPDU and responder process it. But since the connection is already established processing FPDU involves ULP on whose behalf the connection is created. So either initiator sends a message which generates completion on responder CQ, thus visible to ULP, or not. In the later case, the only op which can do it is RDMA one, which means that responder somehow provided initiator S-tag which it can use. So, this is an extension to MPA, probably using private data. And that responder upon receiving it destroy this S-tag. In any case this is an extension of MPA. This stag exchange isn't needed if this RDMA op is a 0B READ. The responder waits for that 0B read and only indicates the rdma connection is established to its ULP when it replies to the 0B read. In this scenario, the responder/server side doesn't consume any CQ resources. But it would require an IRD of at least 1 to be configured on the QP. The initiator still requires an SQ entry, and possibly a CQ entry, for initiating the 0B read and handling completion. But its perhaps a little less painful than doing a SEND/RECV exchange. The read wr could be unsignaled so that it won't generate a CQE. But it still consumes an SQ WR slot so the SQ would have to be sized to allow this extra WR. And I guess the CQ would also need to be sized accordingly in case the read failed. In the former, Send is used but this requires a buffer to be posted to CQ. But since the same CQ (or SharedCQ) can be used by other connections at the same time it can cause the responder CM posted buffer to be consumed by other connection. This is not acceptable. So new we consider extension to MPA protocol. The goal is to be completely backwards compatible to existing version 1. In a nutshell, use a flag in the MPA request message which indicates that ready to receive message will be send by requestor upon receiving MPA response message with connection acceptance. here are the changes to IETF RFC5044 1. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 | | + Key (16 bytes containing MPA ID Req Frame) + 4 | (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65) | + Or (16 bytes containing MPA ID Rep Frame) + 8 | (4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65) | + Or (16 bytes containing MPA ID Rtr Frame) + 12 | (4D 50 41 20 49 44 20 52 74 52 20 46 72 61 6D 65) | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16 |M|C|R|S| Res | Rev | PD_Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | ~ ~ ~ Private Data