RE: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-28 Thread Kanevsky, Arkady
Agree with initiator/client sending signalled 0B RDMA Read.
This will handle client side.

Still not 100% clear on passive/server side.
Two issues which bothers me.
1. Is bogus S-tag allowed for incomming RDMA ops?
I do not recall that RDDP requires that length is checked before
S-tag.

2. How is verb layer on server side knows that RDMA Read op
came and was done? Is it some back door to vendor FW?
Will this be kicked for all incoming RDMA Read ops?

Arkady Kanevsky   email: [EMAIL PROTECTED]
Network Appliance Inc.   phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
Waltham, MA 02451   central phone: 781-768-5300
 

 -Original Message-
 From: Steve Wise [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, November 27, 2007 7:48 PM
 To: Caitlin Bestler
 Cc: Kanevsky, Arkady; Glenn Grundstrom; Leonid Grossman; 
 [EMAIL PROTECTED]
 Subject: Re: [ofa-general] Re: iWARP peer-to-peer CM proposal
 
 Caitlin Bestler wrote:
  On Nov 27, 2007 3:58 PM, Steve Wise 
 [EMAIL PROTECTED] wrote:
  
  For the short term, I claim we just implement this as part 
 of linux 
  iwarp connection setup (mandating a 0B read be sent from 
 the active 
  side).  Your proposal to add meta-data to the private data 
 requires a 
  standards change anyway and is, IMO, the 2nd phase of this whole 
  enchilada...
 
  Steve.
 
  
  I don't see how you can have any solution here that does 
 not require meta-data.
  For non-peer-to-peer connections neither a zero length RDMA Read or 
  Write should be sent. An extraneous RDMA Read is 
 particularly onerous 
  for a short lived connection that fits the classic active/passive 
  model. So *something* is telling the CMA layer that this 
 connection may need an MPA unjam action.
  If that isn't meta-data, what is it?
 
 I assumed the 0B read would _always_ be sent as part of 
 establishing an iWARP connection using linux and the rdma-cm.
 
  
  Further, the RDMA Read solution is adequate whenever the RDMA Write 
  solution would have been (although at an unnecessary extra 
 cost), but 
  as near as I can determine it is not a complete solution. If the 
  passive side needs an untagged message completion then *something* 
  needs to send it. How can the CM layer (or, I suppose, the 
 ULP itself) 
  know that this untagged NOP message must be sent without meta-data?
 
 I believe at Reno we had the current rnic vendors all saying 
 a SEND or 0B read will work.  So:  If someone has current 
 iwarp HW that will _not_
   handle this problem by doing the 0B read hack, please speak up now.
 
  
  As I see it, if we want to do the minimum that is required, but be 
  certain that it is adequate, we need a per-connection setup 
 meta-data exchange.
 
 Are you going to prototype this?
 
 
 Steve.
 
 
 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-28 Thread Kanevsky, Arkady
Any posting to SQ prior to connection establishment will complete
immideately with the flashed status.

Arkady Kanevsky   email: [EMAIL PROTECTED]
Network Appliance Inc.   phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
Waltham, MA 02451   central phone: 781-768-5300
 

 -Original Message-
 From: Glenn Grundstrom [mailto:[EMAIL PROTECTED] 
 Sent: Sunday, November 25, 2007 9:00 PM
 To: Steve Wise; Kanevsky, Arkady
 Cc: Leonid Grossman; [EMAIL PROTECTED]
 Subject: RE: [ofa-general] Re: iWARP peer-to-peer CM proposal
 
  
  Kanevsky, Arkady wrote:
   Very good points.
   Thanks Steve.
  
   If we can do unsignalled 0-size RDMA Read with bogus 
  S-tag this may
   work better.
   Yes, it will require IRD not to be 0 set at Responder.
   Ditto ORD of at least 1 on Responder.
   There is no need to have extra CQ entry on either side for it.
   It is only needed for error path.
   So this will only be needed if Sender posted the full queue
  of sends.
   But it can not post anything because CM will not let it know that 
   connection is established.
  
 
  Well, actually, I think the ULP _can_ post before establishing the 
  connection.  But I guess we can define the semantics such that 
  applications using the rdma-cm interface must adhere to whatever we 
  need to make this hack work.
  
  Q: are there apps using the rdma-cm out there today that 
 pre-post SQ 
  WRs before getting a ESTABLISHED event?
  
  Steve.
 
 ULPs are allowed to post prior to establishing the 
 connection, but I can't name any that operate this way.  
 Prohibiting applications that use the rdma_cm directly from 
 pre-posting is okay, but what about ULP's over other ULP's 
 (i.e. MPI over uDAPL).  How can/will this be handled?
 
 Glenn.
 
 
   Happy Thanksgiving,
  
   Arkady Kanevsky   email: [EMAIL PROTECTED]
   Network Appliance Inc.   phone: 781-768-5395
   1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
   Waltham, MA 02451   central phone: 781-768-5300

  
 
   -Original Message-
   From: Steve Wise [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, November 21, 2007 1:07 PM
   To: Kanevsky, Arkady
   Cc: Glenn Grundstrom; Leonid Grossman; [EMAIL PROTECTED]
   Subject: [ofa-general] Re: iWARP peer-to-peer CM proposal
  
   Comments in-line below...
  
  
   Kanevsky, Arkady wrote:
   
   Group,
  
  
   below is proposal on how to resolve peer-to-peer
  iWARP CM issue
   discovered at interop event.
  
  
   The main issue is that MPA spec (relevant portion of
 
   IETF RFC 5044
   
   is below) require that
  
  
   connection initiator send first message over the
 
   established connection.
   
   Multiple MPI implementations and several other apps use
 
   peer-to-peer
   
   model.
  
  
   So rather then forcing all of them to do it on their
 
   own, which will
   
   not help with
  
  
   interop between different implementations, the goal
  is to extend
   lower layers to provide it.
  
  

  
  
   Our first idea was to leave MPA protocol untouched and
 
   try to solve
   
   this problem
  
  
   in iw_cm. But there are too many complications to it. 
  First, in
   order to adhere to RFC5044
  
  
   initiator must send first FPDU and responder process
 
   it. But since
   
   the connection is already
  
  
   established processing FPDU involves ULP on whose behalf the
   connection is created.
  
  
   So either initiator sends a message which generates
 
   completion on
   
   responder CQ, thus visible
  
  
   to ULP, or not. 
 
  
   
   In the later case, the only op which can do it is
   RDMA one, which means
  
  
   that responder somehow provided initiator S-tag which
 
   it can use.
   
   So, this is an extension
  
  
   to MPA, probably using private data. And that responder upon
   receiving it destroy this S-tag.
  
  
   In any case this is an extension of MPA.
  
 
   This stag exchange isn't needed if this RDMA op is a 0B READ. 
The responder waits for that 0B read and only indicates 
 the rdma 
   connection is established to its ULP when it replies to the 0B 
   read.  In this scenario, the responder/server side 
 doesn't consume 
   any CQ resources.
   But it would require an IRD of at least 1 to be configured
  on the QP. 
   The initiator still requires an SQ entry, and possibly a 
 CQ entry, 
   for initiating the 0B read and handling completion.
   But its perhaps a little less painful than doing a SEND/RECV 
   exchange.  The read wr could be unsignaled so that it won't 
   generate a CQE.  But it still consumes an SQ WR slot so the SQ 
   would have to be sized to allow this extra WR. And I

RE: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-28 Thread Felix Marti


 -Original Message-
 From: [EMAIL PROTECTED] [mailto:general-
 [EMAIL PROTECTED] On Behalf Of Kanevsky, Arkady
 Sent: Wednesday, November 28, 2007 5:30 AM
 To: Steve Wise; Caitlin Bestler
 Cc: Leonid Grossman; [EMAIL PROTECTED]
 Subject: RE: [ofa-general] Re: iWARP peer-to-peer CM proposal
 
 Agree with initiator/client sending signalled 0B RDMA Read.
 This will handle client side.
 
 Still not 100% clear on passive/server side.
 Two issues which bothers me.
 1. Is bogus S-tag allowed for incomming RDMA ops?
 I do not recall that RDDP requires that length is checked before
 S-tag.
 
 2. How is verb layer on server side knows that RDMA Read op
 came and was done? Is it some back door to vendor FW?
 Will this be kicked for all incoming RDMA Read ops?

As you point out, the server Verbs layer is not aware of an incoming 0B
RDMA Read (or Write for that matter). Hence some kind of magic must
happen in the adapter where we vendors will have a choice: a) just
'unjam' the SQ in the adapter (which means that the CM layer works as
today and the server can post SQ ops before the 'unjam' is received but
they won't make it to the wire) or b) send a back-door command to the CM
which can then move the state machine to established only after the
'unjam' is received.

Whatever is done, it cannot happen for all zero-length RDMA Read (or
Write for that matter). Hence the adapter must be informed that that the
next zero-length is the 'unjam' message (which also means that the
server side could, in theory, omit sending the RDMA Read Response,
because the RDMA Read Request was really a 'unjam'... not that I would
be pushing for such an 'optimization' to avoid an extra wire message).

 
 Arkady Kanevsky   email: [EMAIL PROTECTED]
 Network Appliance Inc.   phone: 781-768-5395
 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
 Waltham, MA 02451   central phone: 781-768-5300
 
 
  -Original Message-
  From: Steve Wise [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, November 27, 2007 7:48 PM
  To: Caitlin Bestler
  Cc: Kanevsky, Arkady; Glenn Grundstrom; Leonid Grossman;
  [EMAIL PROTECTED]
  Subject: Re: [ofa-general] Re: iWARP peer-to-peer CM proposal
 
  Caitlin Bestler wrote:
   On Nov 27, 2007 3:58 PM, Steve Wise
  [EMAIL PROTECTED] wrote:
  
   For the short term, I claim we just implement this as part
  of linux
   iwarp connection setup (mandating a 0B read be sent from
  the active
   side).  Your proposal to add meta-data to the private data
  requires a
   standards change anyway and is, IMO, the 2nd phase of this whole
   enchilada...
  
   Steve.
  
  
   I don't see how you can have any solution here that does
  not require meta-data.
   For non-peer-to-peer connections neither a zero length RDMA Read
or
   Write should be sent. An extraneous RDMA Read is
  particularly onerous
   for a short lived connection that fits the classic active/passive
   model. So *something* is telling the CMA layer that this
  connection may need an MPA unjam action.
   If that isn't meta-data, what is it?
 
  I assumed the 0B read would _always_ be sent as part of
  establishing an iWARP connection using linux and the rdma-cm.
 
  
   Further, the RDMA Read solution is adequate whenever the RDMA
Write
   solution would have been (although at an unnecessary extra
  cost), but
   as near as I can determine it is not a complete solution. If the
   passive side needs an untagged message completion then *something*
   needs to send it. How can the CM layer (or, I suppose, the
  ULP itself)
   know that this untagged NOP message must be sent without
meta-data?
 
  I believe at Reno we had the current rnic vendors all saying
  a SEND or 0B read will work.  So:  If someone has current
  iwarp HW that will _not_
handle this problem by doing the 0B read hack, please speak up
now.
 
  
   As I see it, if we want to do the minimum that is required, but be
   certain that it is adequate, we need a per-connection setup
  meta-data exchange.
 
  Are you going to prototype this?
 
 
  Steve.
 
 
 
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-
 general
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-28 Thread Caitlin Bestler


 -Original Message-
 From: Steve Wise [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, November 27, 2007 4:48 PM
 To: Caitlin Bestler
 Cc: Kanevsky, Arkady; Glenn Grundstrom; Leonid Grossman; openib-
 [EMAIL PROTECTED]
 Subject: Re: [ofa-general] Re: iWARP peer-to-peer CM proposal
 
 Caitlin Bestler wrote:
  On Nov 27, 2007 3:58 PM, Steve Wise [EMAIL PROTECTED]
 wrote:
 
  For the short term, I claim we just implement this as part of linux
  iwarp connection setup (mandating a 0B read be sent from the active
  side).  Your proposal to add meta-data to the private data requires
 a
  standards change anyway and is, IMO, the 2nd phase of this whole
  enchilada...
 
  Steve.
 
 
  I don't see how you can have any solution here that does not require
 meta-data.
  For non-peer-to-peer connections neither a zero length RDMA Read or
 Write
  should be sent. An extraneous RDMA Read is particularly onerous for a
 short
  lived connection that fits the classic active/passive model. So
 *something*
  is telling the CMA layer that this connection may need an MPA unjam
 action.
  If that isn't meta-data, what is it?
 
 I assumed the 0B read would _always_ be sent as part of establishing an
 iWARP connection using linux and the rdma-cm.
 

That is an extra round-trip per connection setup, which is a significant
penalty for a short lived connection. It is trivial for HPC/peer-to-peer
applications, but would be a killer for something like HTTP over RDMA.

Doing something like this for *every* connection makes it effectively
a change to the MPA protocol. OFA is not the forum for such discussions,
the IETF is.

OFA drafting an understanding of how peer-to-peer applications use the
existing protocol, on the other hand, is quite reasonable. But it has
to be something done by peer-to-peer middleware or by the verbs layer
in response to a flag from the peer-to-peer middleware. Otherwise it
is not augmenting a protocol, it is changing it.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-28 Thread Steve Wise

Caitlin Bestler wrote:



-Original Message-
From: Steve Wise [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 27, 2007 4:48 PM
To: Caitlin Bestler
Cc: Kanevsky, Arkady; Glenn Grundstrom; Leonid Grossman; openib-
[EMAIL PROTECTED]
Subject: Re: [ofa-general] Re: iWARP peer-to-peer CM proposal

Caitlin Bestler wrote:

On Nov 27, 2007 3:58 PM, Steve Wise [EMAIL PROTECTED]

wrote:

For the short term, I claim we just implement this as part of linux
iwarp connection setup (mandating a 0B read be sent from the active
side).  Your proposal to add meta-data to the private data requires

a

standards change anyway and is, IMO, the 2nd phase of this whole
enchilada...

Steve.


I don't see how you can have any solution here that does not require

meta-data.

For non-peer-to-peer connections neither a zero length RDMA Read or

Write

should be sent. An extraneous RDMA Read is particularly onerous for a

short

lived connection that fits the classic active/passive model. So

*something*

is telling the CMA layer that this connection may need an MPA unjam

action.

If that isn't meta-data, what is it?

I assumed the 0B read would _always_ be sent as part of establishing an
iWARP connection using linux and the rdma-cm.



That is an extra round-trip per connection setup, which is a significant
penalty for a short lived connection. It is trivial for HPC/peer-to-peer
applications, but would be a killer for something like HTTP over RDMA.

Doing something like this for *every* connection makes it effectively
a change to the MPA protocol. OFA is not the forum for such discussions,
the IETF is.

OFA drafting an understanding of how peer-to-peer applications use the
existing protocol, on the other hand, is quite reasonable. But it has
to be something done by peer-to-peer middleware or by the verbs layer
in response to a flag from the peer-to-peer middleware. Otherwise it
is not augmenting a protocol, it is changing it.



posting a 0B read after the mpa setup isn't changing the MPA protocol. 
Its adding a protocol on top of the MPA setup in order to meet the 
requirements of the MPA protocol.  Whether you add a private-data 
request for this or _assume_ the 0B read will happen doesn't change this.




___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-28 Thread Tom Tucker

On Wed, 2007-11-28 at 11:43 -0500, Caitlin Bestler wrote:
 
  -Original Message-
  From: Steve Wise [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, November 27, 2007 4:48 PM
  To: Caitlin Bestler
  Cc: Kanevsky, Arkady; Glenn Grundstrom; Leonid Grossman; openib-
  [EMAIL PROTECTED]
  Subject: Re: [ofa-general] Re: iWARP peer-to-peer CM proposal
  
  Caitlin Bestler wrote:
   On Nov 27, 2007 3:58 PM, Steve Wise [EMAIL PROTECTED]
  wrote:
  
   For the short term, I claim we just implement this as part of linux
   iwarp connection setup (mandating a 0B read be sent from the active
   side).  Your proposal to add meta-data to the private data requires
  a
   standards change anyway and is, IMO, the 2nd phase of this whole
   enchilada...
  
   Steve.
  
  
   I don't see how you can have any solution here that does not require
  meta-data.
   For non-peer-to-peer connections neither a zero length RDMA Read or
  Write
   should be sent. An extraneous RDMA Read is particularly onerous for a
  short
   lived connection that fits the classic active/passive model. So
  *something*
   is telling the CMA layer that this connection may need an MPA unjam
  action.
   If that isn't meta-data, what is it?
  
  I assumed the 0B read would _always_ be sent as part of establishing an
  iWARP connection using linux and the rdma-cm.
  
 
 That is an extra round-trip per connection setup, which is a significant
 penalty for a short lived connection. It is trivial for HPC/peer-to-peer
 applications, but would be a killer for something like HTTP over RDMA.
 

I find it hard to get excited about optimizing short lived connections
for RDMA. I simply don't think it's an interesting use case. And btw,
HTTP long ago got rid of short lived connections because it's painful
even on TCP.

 Doing something like this for *every* connection makes it effectively
 a change to the MPA protocol.

Uh. No, it doesn't. Normalizing the behavior of applications during
connection setup doesn't change the underlying protocol. It adds another
one on top.

  OFA is not the forum for such discussions,
 the IETF is.

My living room, the dinner table, the local bar and this mailing list
are perfectly acceptable forums for discussing a protocol. The IETF is
the forum for standardizing one. Right now, I don't think we're ready to
standardize, because we're still exploring the options; the first of
which is NOT changing MPA.

This group has the unique benefit of actually USING and IMPLEMENTING the
protocol and therefore has some beneficial insights that may and should
be shared. All that said revving the MPA protocol is way down the road. 

 
 OFA drafting an understanding of how peer-to-peer applications use the
 existing protocol, on the other hand, is quite reasonable. 

That's step 1 and the 0B READ is one way to do it.

 But it has
 to be something done by peer-to-peer middleware or by the verbs layer
 in response to a flag from the peer-to-peer middleware. Otherwise it
 is not augmenting a protocol, it is changing it.
 

The flag may be useful, however, I don't see the connection between the
flag and complying with the MPA protocol.

 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-28 Thread Steve Wise

Kanevsky, Arkady wrote:

ULP can post recvs before connection is established but not to send
queue
prior to connection establishment.



I hate quoting specs (and the RDMAC verbs spec isn't really any 
standard), but, page 25 of draft-hilland-iwarp-verbs-v1.0 indicates its 
ok to post SQ WRs when in idle:



The QP MUST be in the Idle state following QP creation or when moved to 
this state with Modify QP. In this state, Send or Receive WRs MAY be 
posted but they MUST NOT be processed and CQEs MUST NOT be generated.


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-28 Thread Steve Wise

Kanevsky, Arkady wrote:

Agree with initiator/client sending signalled 0B RDMA Read.
This will handle client side.

Still not 100% clear on passive/server side.
Two issues which bothers me.
1. Is bogus S-tag allowed for incomming RDMA ops?


The stag/to must not be validated if the incoming read is 0B length.

http://www.ietf.org/rfc/rfc5040.txt:


*  If the Data Source receives an RDMA Read Request Header with the
  RDMA Read Message Size set to zero, the Data Source RDMAP:

  *  MUST NOT validate the Data Source STag and Data Source Tagged
 Offset contained in the RDMA Read Request Header, and

  *  MUST respond with a zero-length RDMA Read Response Message.



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-28 Thread Kanevsky, Arkady
Another small discreptancy between IB and iWARP.
Since RDMA_CM is used for ULP which are transport
independent they will follow the stricter rule.
That is IB. For IB any posting to SQ prior to QP
being in RTS state shall be flushed.

This semantic is actually very useful for ULPs which
use insignalled completions. Because, once you see
the completion for the request you posted after connection
failure you are sure that all previously posted request on the
same SQ are completed and had you had seen them all.

So while, you are correct on the spec since we are working
in IW_CM we can assume IB semantic on posting.

Thanks,

Arkady Kanevsky   email: [EMAIL PROTECTED]
Network Appliance Inc.   phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
Waltham, MA 02451   central phone: 781-768-5300
 

 -Original Message-
 From: Steve Wise [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, November 28, 2007 1:52 PM
 To: Kanevsky, Arkady
 Cc: Glenn Grundstrom; Leonid Grossman; [EMAIL PROTECTED]
 Subject: Re: [ofa-general] Re: iWARP peer-to-peer CM proposal
 
 Kanevsky, Arkady wrote:
  ULP can post recvs before connection is established but not to send 
  queue prior to connection establishment.
  
 
 I hate quoting specs (and the RDMAC verbs spec isn't really any 
 standard), but, page 25 of draft-hilland-iwarp-verbs-v1.0 
 indicates its 
 ok to post SQ WRs when in idle:
 
 
 The QP MUST be in the Idle state following QP creation or 
 when moved to 
 this state with Modify QP. In this state, Send or Receive WRs MAY be 
 posted but they MUST NOT be processed and CQEs MUST NOT be generated.
 
 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-28 Thread Or Gerlitz
On 11/29/07, Kanevsky, Arkady [EMAIL PROTECTED] wrote:
 So while, you are correct on the spec since we are working
 in IW_CM we can assume IB semantic on posting.

please spend a minute on
http://www.zip.com.au/~akpm/linux/patches/stuff/top-posting.txt

Or.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-27 Thread Kanevsky, Arkady
ULP can post recvs before connection is established but not to send
queue
prior to connection establishment.

Arkady Kanevsky   email: [EMAIL PROTECTED]
Network Appliance Inc.   phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
Waltham, MA 02451   central phone: 781-768-5300
 

 -Original Message-
 From: Glenn Grundstrom [mailto:[EMAIL PROTECTED] 
 Sent: Sunday, November 25, 2007 9:00 PM
 To: Steve Wise; Kanevsky, Arkady
 Cc: Leonid Grossman; [EMAIL PROTECTED]
 Subject: RE: [ofa-general] Re: iWARP peer-to-peer CM proposal
 
  
  Kanevsky, Arkady wrote:
   Very good points.
   Thanks Steve.
  
   If we can do unsignalled 0-size RDMA Read with bogus 
  S-tag this may
   work better.
   Yes, it will require IRD not to be 0 set at Responder.
   Ditto ORD of at least 1 on Responder.
   There is no need to have extra CQ entry on either side for it.
   It is only needed for error path.
   So this will only be needed if Sender posted the full queue
  of sends.
   But it can not post anything because CM will not let it know that 
   connection is established.
  
 
  Well, actually, I think the ULP _can_ post before establishing the 
  connection.  But I guess we can define the semantics such that 
  applications using the rdma-cm interface must adhere to whatever we 
  need to make this hack work.
  
  Q: are there apps using the rdma-cm out there today that 
 pre-post SQ 
  WRs before getting a ESTABLISHED event?
  
  Steve.
 
 ULPs are allowed to post prior to establishing the 
 connection, but I can't name any that operate this way.  
 Prohibiting applications that use the rdma_cm directly from 
 pre-posting is okay, but what about ULP's over other ULP's 
 (i.e. MPI over uDAPL).  How can/will this be handled?
 
 Glenn.
 
 
   Happy Thanksgiving,
  
   Arkady Kanevsky   email: [EMAIL PROTECTED]
   Network Appliance Inc.   phone: 781-768-5395
   1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
   Waltham, MA 02451   central phone: 781-768-5300

  
 
   -Original Message-
   From: Steve Wise [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, November 21, 2007 1:07 PM
   To: Kanevsky, Arkady
   Cc: Glenn Grundstrom; Leonid Grossman; [EMAIL PROTECTED]
   Subject: [ofa-general] Re: iWARP peer-to-peer CM proposal
  
   Comments in-line below...
  
  
   Kanevsky, Arkady wrote:
   
   Group,
  
  
   below is proposal on how to resolve peer-to-peer
  iWARP CM issue
   discovered at interop event.
  
  
   The main issue is that MPA spec (relevant portion of
 
   IETF RFC 5044
   
   is below) require that
  
  
   connection initiator send first message over the
 
   established connection.
   
   Multiple MPI implementations and several other apps use
 
   peer-to-peer
   
   model.
  
  
   So rather then forcing all of them to do it on their
 
   own, which will
   
   not help with
  
  
   interop between different implementations, the goal
  is to extend
   lower layers to provide it.
  
  

  
  
   Our first idea was to leave MPA protocol untouched and
 
   try to solve
   
   this problem
  
  
   in iw_cm. But there are too many complications to it. 
  First, in
   order to adhere to RFC5044
  
  
   initiator must send first FPDU and responder process
 
   it. But since
   
   the connection is already
  
  
   established processing FPDU involves ULP on whose behalf the
   connection is created.
  
  
   So either initiator sends a message which generates
 
   completion on
   
   responder CQ, thus visible
  
  
   to ULP, or not. 
 
  
   
   In the later case, the only op which can do it is
   RDMA one, which means
  
  
   that responder somehow provided initiator S-tag which
 
   it can use.
   
   So, this is an extension
  
  
   to MPA, probably using private data. And that responder upon
   receiving it destroy this S-tag.
  
  
   In any case this is an extension of MPA.
  
 
   This stag exchange isn't needed if this RDMA op is a 0B READ. 
The responder waits for that 0B read and only indicates 
 the rdma 
   connection is established to its ULP when it replies to the 0B 
   read.  In this scenario, the responder/server side 
 doesn't consume 
   any CQ resources.
   But it would require an IRD of at least 1 to be configured
  on the QP. 
   The initiator still requires an SQ entry, and possibly a 
 CQ entry, 
   for initiating the 0B read and handling completion.
   But its perhaps a little less painful than doing a SEND/RECV 
   exchange.  The read wr could be unsignaled so that it won't 
   generate a CQE.  But it still consumes an SQ WR slot so the SQ 
   would have to be sized to allow this extra WR

Re: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-27 Thread Caitlin Bestler
On Nov 27, 2007 6:54 AM, Kanevsky, Arkady [EMAIL PROTECTED] wrote:
 ULP can post recvs before connection is established but not to send
 queue
 prior to connection establishment.



ULP can post sends only after it is notified that the connection is established.

The issue is when the iWARP layer can issue this notification.

If the MPA layer implements fencing on its own, then the notification can
be provided immediately after the MPA Request/Response exchange.

If not, it must wait for the first MPA frame. The problem is that
implementations
that adhere to closely to the RDMAC verbs can obtain no information about
the connection unless there is a CQE producing event.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-27 Thread Steve Wise

Caitlin Bestler wrote:

On Nov 27, 2007 6:54 AM, Kanevsky, Arkady [EMAIL PROTECTED] wrote:

ULP can post recvs before connection is established but not to send
queue
prior to connection establishment.




ULP can post sends only after it is notified that the connection is established.

The issue is when the iWARP layer can issue this notification.

If the MPA layer implements fencing on its own, then the notification can
be provided immediately after the MPA Request/Response exchange.

If not, it must wait for the first MPA frame. The problem is that
implementations
that adhere to closely to the RDMAC verbs can obtain no information about
the connection unless there is a CQE producing event.


The idea for this hack is that the passive side (the side that sends 
the MPA response) will hold off posting the ESTABLISHED event to the 
rdma-cm ULP until after it receives this 0B Read Request from the client...




___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-27 Thread Sean Hefty
The idea for this hack is that the passive side (the side that sends 
the MPA response) will hold off posting the ESTABLISHED event to the 
rdma-cm ULP until after it receives this 0B Read Request from the client...


What is notifying the passive side that the active side has completed a 
read request, and that it's okay to start sending?


Also, at least with IB, a QP be configured on creation to always 
generate a CQ entry for all WRs posted to the send queue.  I don't know 
if iWarp follows this same model.


- Sean
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-27 Thread Steve Wise

Sean Hefty wrote:
The idea for this hack is that the passive side (the side that sends 
the MPA response) will hold off posting the ESTABLISHED event to the 
rdma-cm ULP until after it receives this 0B Read Request from the 
client...


What is notifying the passive side that the active side has completed a 
read request, and that it's okay to start sending?




The iwarp provider driver will only post the IW_CM_ESTABLISHED event 
after receiving the read request.  For the Chelsio provider, this will 
require changes to the rnic firmware and the driver/library to support 
all this.


I haven't thought through exactly how this should be implemented.  For 
instance, the provider library poll function needs to deal with this 0B 
read completion and note that it is this special connection setup 0B 
read and thus hide the completion from the user call poll()...



Also, at least with IB, a QP be configured on creation to always 
generate a CQ entry for all WRs posted to the send queue.  I don't know 
if iWarp follows this same model.


After thinking about this more, I think we do want to make this 0B read 
signaled.  Then we can post the IW_CM_ESTABLISHED event on the client 
side when the read request completes.


So from the RDMA application's perspective, the connection never gets 
setup until this 0B read is completed, and that's really what we want...


Steve.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-27 Thread Steve Wise

Caitlin Bestler wrote:

On Nov 27, 2007 3:13 PM, Steve Wise [EMAIL PROTECTED] wrote:

Caitlin Bestler wrote:

On Nov 27, 2007 6:54 AM, Kanevsky, Arkady [EMAIL PROTECTED] wrote:

ULP can post recvs before connection is established but not to send
queue prior to connection establishment.



ULP can post sends only after it is notified that the connection is established.

The issue is when the iWARP layer can issue this notification.

If the MPA layer implements fencing on its own, then the notification can
be provided immediately after the MPA Request/Response exchange.

If not, it must wait for the first MPA frame. The problem is that
implementations that adhere to closely to the RDMAC verbs can obtain
no information about the connection unless there is a CQE producing event.

The idea for this hack is that the passive side (the side that sends
the MPA response) will hold off posting the ESTABLISHED event to the
rdma-cm ULP until after it receives this 0B Read Request from the client...



The problem is that this solution is being applied at the wrong layer.

MPA is not the source of the problem, but rather the RDMAC layer verbs.
The solution needs to be a verb-layer solution, not an MPA layer solution.



This isn't being solved at the MPA layer.  It being solved as a protocol 
 exchange done after the MPA exchanges (and after the connections are 
transitioned into FPDU mode.  Remeber: This is a _hack_ to get our 
current generation of rnics to support peer-to-peer _without_ impacting 
the rdma applications (like IMPI and OMPI).



Steve's last comment states the problem well: we are trying to enable the
Verbs layer on the Passive side to generate the Established event, and
if at all possible to do so in a way that places no requirements on the
application layer.

I believe it is possible to do so without making any modifications to MPA.



Yes.


The MPA protocol requirement is a safeguard against receiving an MPA
Frame before the MPA Response frame. MPA does not have or need an
RTR message, because the MPA RFC allows *any* MPA frame from the
active side to effectively acknowledge receipt of the MPA Response.



Yes, but it puts the onus on the ULP to deal with this.  In our current 
implementation model, that ULP is the top end application.



That includes a zero-length RDMA Write.

An iWARP implementation can (perhaps SHOULD) implement an MPA
Fenced state on the passive side that is cleared on receipt of any MPA
frame. With such a MPA Fence feature, the CM layer can generate the
Connection Established event as soon as it sends the MPA Response
and the Passive-side ULP will be able to post to the SQ, the messages
just won't go the wire until something is received.

Meanwhile the Active Side must ensure that *some* MPA frame is sent
immediately after the MPA Response is received. If it has traffic ready to
go it can simply send that. If it does not, it can use a zero-length write.
A zero-length write is totally transparent to the ULP at both ends.

But that will only work for *some* implementations. On others a zero
length RDMA Read is needed to unjam things. That's almost transparent,
but not totally so since it temporarily uses an RDMA Read credit.



Right.  Chelsio needs a Read vs a Write because the FW and driver don't 
detect the incoming 0B write so they cannot drive the ESTABLISHED event 
on that.



And while nobody has spoken up to say *they* have that problem, I would
not be surprised if there are implementations where nothing less than a full
ULP nop message will suffice.

So keeping the fix at the verbs layer, and allowing the minimal extra
effort to be controlled by the Passive layer itself, suggests that the
Passive side simply encode its MPA-unjam-action-required in the
OFA standardized portion of the Private Data. Encodings would
include:

- Any MPA Frame, including a zero-length RDMA Write will unjam
  the passive side SendQ.
- An untagged message or a zero-length RDMA Read will work.
- Only an untagged message will work.



So you're advocating adding a standardized header to the private data to 
indicate what the passive side needs.  While we're at it, lets add in 
ORD/IRD ;-)



In the latter cases the middleware will have to play games with standin
receive WQEs and only posting the actual receive WQEs to the QP
after the MPA fence has been unjammed. That isn't pretty, but if your
hardware is fixed then it's either that or make the application deal with
the problem. I have a hunch that the MPI developers would not like that
option at all.

How this differs from what Arkady proposed is that it avoids making any
changes to MPA, but instead only makes use of the OFA defined portion
of the Private Data. Further it allows use of a zero-length RDMA Write
when that is sufficient to break the MPA logjam. A zero-length RDMA
Write, unlike a zero-length RDMA Read, is *totally* transparent to the ULP.


For the short term, I claim we just implement this as part of linux 
iwarp connection 

Re: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-27 Thread Caitlin Bestler
On Nov 27, 2007 3:58 PM, Steve Wise [EMAIL PROTECTED] wrote:


 For the short term, I claim we just implement this as part of linux
 iwarp connection setup (mandating a 0B read be sent from the active
 side).  Your proposal to add meta-data to the private data requires a
 standards change anyway and is, IMO, the 2nd phase of this whole
 enchilada...

 Steve.


I don't see how you can have any solution here that does not require meta-data.
For non-peer-to-peer connections neither a zero length RDMA Read or Write
should be sent. An extraneous RDMA Read is particularly onerous for a short
lived connection that fits the classic active/passive model. So *something*
is telling the CMA layer that this connection may need an MPA unjam action.
If that isn't meta-data, what is it?

Further, the RDMA Read solution is adequate whenever the RDMA Write
solution would have been (although at an unnecessary extra cost), but
as near as I can determine it is not a complete solution. If the passive
side needs an untagged message completion then *something* needs
to send it. How can the CM layer (or, I suppose, the ULP itself) know
that this untagged NOP message must be sent without meta-data?

As I see it, if we want to do the minimum that is required, but be certain
that it is adequate, we need a per-connection setup meta-data exchange.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-27 Thread Steve Wise

Caitlin Bestler wrote:

On Nov 27, 2007 3:58 PM, Steve Wise [EMAIL PROTECTED] wrote:


For the short term, I claim we just implement this as part of linux
iwarp connection setup (mandating a 0B read be sent from the active
side).  Your proposal to add meta-data to the private data requires a
standards change anyway and is, IMO, the 2nd phase of this whole
enchilada...

Steve.



I don't see how you can have any solution here that does not require meta-data.
For non-peer-to-peer connections neither a zero length RDMA Read or Write
should be sent. An extraneous RDMA Read is particularly onerous for a short
lived connection that fits the classic active/passive model. So *something*
is telling the CMA layer that this connection may need an MPA unjam action.
If that isn't meta-data, what is it?


I assumed the 0B read would _always_ be sent as part of establishing an 
iWARP connection using linux and the rdma-cm.




Further, the RDMA Read solution is adequate whenever the RDMA Write
solution would have been (although at an unnecessary extra cost), but
as near as I can determine it is not a complete solution. If the passive
side needs an untagged message completion then *something* needs
to send it. How can the CM layer (or, I suppose, the ULP itself) know
that this untagged NOP message must be sent without meta-data?


I believe at Reno we had the current rnic vendors all saying a SEND or 
0B read will work.  So:  If someone has current iwarp HW that will _not_ 
 handle this problem by doing the 0B read hack, please speak up now.




As I see it, if we want to do the minimum that is required, but be certain
that it is adequate, we need a per-connection setup meta-data exchange.


Are you going to prototype this?


Steve.



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-25 Thread Glenn Grundstrom
 
 Kanevsky, Arkady wrote:
  Very good points.
  Thanks Steve.
 
  If we can do unsignalled 0-size RDMA Read with bogus 
 S-tag this may
  work better.
  Yes, it will require IRD not to be 0 set at Responder.
  Ditto ORD of at least 1 on Responder.
  There is no need to have extra CQ entry on either side for it.
  It is only needed for error path.
  So this will only be needed if Sender posted the full queue 
 of sends.
  But it can not post anything because CM will not let it know that
  connection is established.
 

 Well, actually, I think the ULP _can_ post before establishing the 
 connection.  But I guess we can define the semantics such that 
 applications using the rdma-cm interface must adhere to 
 whatever we need 
 to make this hack work.
 
 Q: are there apps using the rdma-cm out there today that 
 pre-post SQ WRs 
 before getting a ESTABLISHED event?
 
 Steve.

ULPs are allowed to post prior to establishing the connection, but I
can't name any that operate this way.  Prohibiting applications
that use the rdma_cm directly from pre-posting is okay, but what
about ULP's over other ULP's (i.e. MPI over uDAPL).  How can/will
this be handled?

Glenn.


  Happy Thanksgiving,
 
  Arkady Kanevsky   email: [EMAIL PROTECTED]
  Network Appliance Inc.   phone: 781-768-5395
  1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
  Waltham, MA 02451   central phone: 781-768-5300
   
 

  -Original Message-
  From: Steve Wise [mailto:[EMAIL PROTECTED] 
  Sent: Wednesday, November 21, 2007 1:07 PM
  To: Kanevsky, Arkady
  Cc: Glenn Grundstrom; Leonid Grossman; [EMAIL PROTECTED]
  Subject: [ofa-general] Re: iWARP peer-to-peer CM proposal
 
  Comments in-line below...
 
 
  Kanevsky, Arkady wrote:
  
  Group,
 
 
  below is proposal on how to resolve peer-to-peer 
 iWARP CM issue
  discovered at interop event.
 
 
  The main issue is that MPA spec (relevant portion of 

  IETF RFC 5044
  
  is below) require that
 
 
  connection initiator send first message over the 

  established connection.
  
  Multiple MPI implementations and several other apps use 

  peer-to-peer
  
  model.
 
 
  So rather then forcing all of them to do it on their 

  own, which will
  
  not help with
 
 
  interop between different implementations, the goal 
 is to extend
  lower layers to provide it.
 
 
   
 
 
  Our first idea was to leave MPA protocol untouched and 

  try to solve
  
  this problem
 
 
  in iw_cm. But there are too many complications to it. 
 First, in
  order to adhere to RFC5044
 
 
  initiator must send first FPDU and responder process 

  it. But since
  
  the connection is already
 
 
  established processing FPDU involves ULP on whose behalf the
  connection is created.
 
 
  So either initiator sends a message which generates 

  completion on
  
  responder CQ, thus visible
 
 
  to ULP, or not. 

 
  
  In the later case, the only op which can do it is
  RDMA one, which means
 
 
  that responder somehow provided initiator S-tag which 

  it can use.
  
  So, this is an extension
 
 
  to MPA, probably using private data. And that responder upon
  receiving it destroy this S-tag.
 
 
  In any case this is an extension of MPA.
 

  This stag exchange isn't needed if this RDMA op is a 0B READ. 
   The responder waits for that 0B read and only indicates the 
  rdma connection is established to its ULP when it replies to 
  the 0B read.  In this scenario, the responder/server side 
  doesn't consume any CQ resources. 
  But it would require an IRD of at least 1 to be configured 
 on the QP. 
  The initiator still requires an SQ entry, and possibly a CQ 
  entry, for initiating the 0B read and handling completion.  
  But its perhaps a little less painful than doing a SEND/RECV 
  exchange.  The read wr could be unsignaled so that it won't 
  generate a CQE.  But it still consumes an SQ WR slot so the 
  SQ would have to be sized to allow this extra WR. And I guess 
  the CQ would also need to be sized accordingly in case the 
  read failed.
 
  
  In the former, Send is used but this requires a buffer 

  to be posted
  
  to CQ. But since
 
 
  the same CQ (or SharedCQ) can be used by other 

  connections at the
  
  same time it can cause
 
 
  the responder CM posted buffer to be consumed by other 

  connection.
  
  This is not acceptable.
 
 
   
 
 
  So new we consider extension to MPA protocol.
 
 
  The goal is to be completely backwards compatible to 

  existing version 1.
  
  In a nutshell, use a flag in the MPA request message which
  indicates that
 
 
  ready to receive message will be send

Re: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-23 Thread Steve Wise



Kanevsky, Arkady wrote:

Very good points.
Thanks Steve.

If we can do unsignalled 0-size RDMA Read with bogus S-tag this may
work better.
Yes, it will require IRD not to be 0 set at Responder.
Ditto ORD of at least 1 on Responder.
There is no need to have extra CQ entry on either side for it.
It is only needed for error path.
So this will only be needed if Sender posted the full queue of sends.
But it can not post anything because CM will not let it know that
connection is established.

  
Well, actually, I think the ULP _can_ post before establishing the 
connection.  But I guess we can define the semantics such that 
applications using the rdma-cm interface must adhere to whatever we need 
to make this hack work.


Q: are there apps using the rdma-cm out there today that pre-post SQ WRs 
before getting a ESTABLISHED event?


Steve.

Happy Thanksgiving,

Arkady Kanevsky   email: [EMAIL PROTECTED]
Network Appliance Inc.   phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
Waltham, MA 02451   central phone: 781-768-5300
 

  

-Original Message-
From: Steve Wise [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 21, 2007 1:07 PM

To: Kanevsky, Arkady
Cc: Glenn Grundstrom; Leonid Grossman; [EMAIL PROTECTED]
Subject: [ofa-general] Re: iWARP peer-to-peer CM proposal

Comments in-line below...


Kanevsky, Arkady wrote:


Group,


below is proposal on how to resolve peer-to-peer iWARP CM issue
discovered at interop event.


The main issue is that MPA spec (relevant portion of 
  

IETF RFC 5044


is below) require that


connection initiator send first message over the 
  

established connection.

Multiple MPI implementations and several other apps use 
  

peer-to-peer


model.


So rather then forcing all of them to do it on their 
  

own, which will


not help with


interop between different implementations, the goal is to extend
lower layers to provide it.


 



Our first idea was to leave MPA protocol untouched and 
  

try to solve


this problem


in iw_cm. But there are too many complications to it. First, in
order to adhere to RFC5044


initiator must send first FPDU and responder process 
  

it. But since


the connection is already


established processing FPDU involves ULP on whose behalf the
connection is created.


So either initiator sends a message which generates 
  

completion on


responder CQ, thus visible


to ULP, or not. 
  




In the later case, the only op which can do it is
RDMA one, which means


that responder somehow provided initiator S-tag which 
  

it can use.


So, this is an extension


to MPA, probably using private data. And that responder upon
receiving it destroy this S-tag.


In any case this is an extension of MPA.

  
This stag exchange isn't needed if this RDMA op is a 0B READ. 
 The responder waits for that 0B read and only indicates the 
rdma connection is established to its ULP when it replies to 
the 0B read.  In this scenario, the responder/server side 
doesn't consume any CQ resources. 
But it would require an IRD of at least 1 to be configured on the QP. 
The initiator still requires an SQ entry, and possibly a CQ 
entry, for initiating the 0B read and handling completion.  
But its perhaps a little less painful than doing a SEND/RECV 
exchange.  The read wr could be unsignaled so that it won't 
generate a CQE.  But it still consumes an SQ WR slot so the 
SQ would have to be sized to allow this extra WR. And I guess 
the CQ would also need to be sized accordingly in case the 
read failed.



In the former, Send is used but this requires a buffer 
  

to be posted


to CQ. But since


the same CQ (or SharedCQ) can be used by other 
  

connections at the


same time it can cause


the responder CM posted buffer to be consumed by other 
  

connection.


This is not acceptable.


 



So new we consider extension to MPA protocol.


The goal is to be completely backwards compatible to 
  

existing version 1.


In a nutshell, use a flag in the MPA request message which
indicates that


ready to receive message will be send by requestor upon 
receiving



MPA response message with connection acceptance.


 



here are the changes to IETF RFC5044


 



1. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
  

2 3 4 5 6 7 8


9 0 1

  

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0

| | + Key (16 bytes containing MPA ID Req Frame) + 4 
  

| (4D 50 41

20 49 44 20 52 65 71 20 46 72 61 6D 65) | + Or (16 
  

bytes containing

MPA ID Rep Frame) + 8 | (4D 50 41 20 49 44 20 52 65 
  

70 20 46 72 61

RE: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-21 Thread Kanevsky, Arkady
Very good points.
Thanks Steve.

If we can do unsignalled 0-size RDMA Read with bogus S-tag this may
work better.
Yes, it will require IRD not to be 0 set at Responder.
Ditto ORD of at least 1 on Responder.
There is no need to have extra CQ entry on either side for it.
It is only needed for error path.
So this will only be needed if Sender posted the full queue of sends.
But it can not post anything because CM will not let it know that
connection is established.

Happy Thanksgiving,

Arkady Kanevsky   email: [EMAIL PROTECTED]
Network Appliance Inc.   phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
Waltham, MA 02451   central phone: 781-768-5300
 

 -Original Message-
 From: Steve Wise [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, November 21, 2007 1:07 PM
 To: Kanevsky, Arkady
 Cc: Glenn Grundstrom; Leonid Grossman; [EMAIL PROTECTED]
 Subject: [ofa-general] Re: iWARP peer-to-peer CM proposal
 
 Comments in-line below...
 
 
 Kanevsky, Arkady wrote:
  
  Group,
  
  
  below is proposal on how to resolve peer-to-peer iWARP CM issue
  discovered at interop event.
  
  
  The main issue is that MPA spec (relevant portion of 
 IETF RFC 5044
  is below) require that
  
  
  connection initiator send first message over the 
 established connection.
  
  
  Multiple MPI implementations and several other apps use 
 peer-to-peer
  model.
  
  
  So rather then forcing all of them to do it on their 
 own, which will
  not help with
  
  
  interop between different implementations, the goal is to extend
  lower layers to provide it.
  
  
   
  
  
  Our first idea was to leave MPA protocol untouched and 
 try to solve
  this problem
  
  
  in iw_cm. But there are too many complications to it. First, in
  order to adhere to RFC5044
  
  
  initiator must send first FPDU and responder process 
 it. But since
  the connection is already
  
  
  established processing FPDU involves ULP on whose behalf the
  connection is created.
  
  
  So either initiator sends a message which generates 
 completion on
  responder CQ, thus visible
  
  
  to ULP, or not. 
 
 
 
  In the later case, the only op which can do it is
  RDMA one, which means
  
  
  that responder somehow provided initiator S-tag which 
 it can use.
  So, this is an extension
  
  
  to MPA, probably using private data. And that responder upon
  receiving it destroy this S-tag.
  
  
  In any case this is an extension of MPA.
  
 
 
 This stag exchange isn't needed if this RDMA op is a 0B READ. 
  The responder waits for that 0B read and only indicates the 
 rdma connection is established to its ULP when it replies to 
 the 0B read.  In this scenario, the responder/server side 
 doesn't consume any CQ resources. 
 But it would require an IRD of at least 1 to be configured on the QP. 
 The initiator still requires an SQ entry, and possibly a CQ 
 entry, for initiating the 0B read and handling completion.  
 But its perhaps a little less painful than doing a SEND/RECV 
 exchange.  The read wr could be unsignaled so that it won't 
 generate a CQE.  But it still consumes an SQ WR slot so the 
 SQ would have to be sized to allow this extra WR. And I guess 
 the CQ would also need to be sized accordingly in case the 
 read failed.
 
  
  In the former, Send is used but this requires a buffer 
 to be posted
  to CQ. But since
  
  
  the same CQ (or SharedCQ) can be used by other 
 connections at the
  same time it can cause
  
  
  the responder CM posted buffer to be consumed by other 
 connection.
  This is not acceptable.
  
  
   
  
  
  So new we consider extension to MPA protocol.
  
  
  The goal is to be completely backwards compatible to 
 existing version 1.
  
  
  In a nutshell, use a flag in the MPA request message which
  indicates that
  
  
  ready to receive message will be send by requestor upon 
  receiving
  
  
  MPA response message with connection acceptance.
  
  
   
  
  
  here are the changes to IETF RFC5044
  
  
   
  
  
  1. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
 2 3 4 5 6 7 8
  9 0 1
  
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0
  | | + Key (16 bytes containing MPA ID Req Frame) + 4 
 | (4D 50 41
  20 49 44 20 52 65 71 20 46 72 61 6D 65) | + Or (16 
 bytes containing
  MPA ID Rep Frame) + 8 | (4D 50 41 20 49 44 20 52 65 
 70 20 46 72 61
  6D 65) | + Or (16 bytes containing MPA ID Rtr Frame) 
 + 12 | (4D 50
  41 20 49 44 20 52 74 52 20 46 72 61 6D 65) | +
  
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16
  |M|C|R|S| Res | Rev | PD_Length |
  
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
  | ~ ~ ~ Private Data