[ewg] RE: [ofa-general] Re: [PATCH] Request For Comments:

2008-05-08 Thread Sean Hefty
The requirement is mostly driven from the receiving side.  For cxgb3 it
is anyway...

Maybe you can help me understand the spec here.  If we ignore this feature for a
minute, then the side that calls rdma_connect() must instead issue the first
'send' request to the server.  Can the first 'send' be a 0B rdma write or read?
Why wouldn't the target of that request not have to transition to connected?

Is the issue that there's no way for the receiving FW/driver to know that this
has occurred so that it can signal that the connection has been established?
I.e. a client that does this must signal the server that things are ready
through some out of band means.

server sends MPA Start response with lets do RTR and send me X  where
X could be 0B write, 0B read request or 0B send.

Are there any restrictions where a client may not be able to issue what the
server requests?  E.g. the hardware doesn't issue 0B writes.

- Sean

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [ofa-general] Re: [PATCH] Request For Comments:

2008-05-08 Thread Steve Wise



From RFC 5044, section 7.1.2 Connection Startup Rules, Page 29:

4.  MPA Responder mode implementations MUST receive and validate at
  least one FPDU before sending any FPDUs or Markers.

  Note: This requirement is present to allow the Initiator time to
  get its receiver into Full Operation before an FPDU arrives,
  avoiding potential race conditions at the Initiator.  This
  was also subject to some debate in the work group before
  rough consensus was reached.  Eliminating this requirement
  would allow faster startup in some types of applications.
  However, that would also make certain implementations
  (particularly dual stack) much harder.




Steve Wise wrote:

Sean Hefty wrote:

The requirement is mostly driven from the receiving side.  For cxgb3 it
is anyway...



Maybe you can help me understand the spec here.  If we ignore this feature for a
minute, then the side that calls rdma_connect() must instead issue the first
'send' request to the server.  Can the first 'send' be a 0B rdma write or read?
  
According to the MPI IETF RFC, the initiator must send the first 
FPDU.  That could be anything.  The spec leaves it up to the ULP.



Why wouldn't the target of that request not have to transition to connected?

  
I don't understand this question?  What does 'transition to connected' 
mean?


The requirement is that the responder (the side that issues the 
rdma_accept in rdma-cma terms) _cannot_ send an FPDU until it first 
receives one from the initiator.   How that is enforces is an 
implementation detail.  The responder driver could hold off on the 
ESTABLISHED event until it receives the first FPDU.  Or it could stall 
SQ processing until the first FPDU is received yet still indicate that 
the connection is ESTABLISHED.



Is the issue that there's no way for the receiving FW/driver to know that this
has occurred so that it can signal that the connection has been established?
I.e. a client that does this must signal the server that things are ready
through some out of band means.

  
I don't understand what you're getting at exactly. 

The issue is that the server doesn't know when the client receives the 
MPA Start Response and has successfully transitioned the connection 
into RDMA mode.  IF the server sends an FPDU immediately following the 
MPA Start Response (which is in streaming mode), then its possible for 
that first FPDU to get passed up to the driver/ULP as streaming mode 
data.  Which breaks everything.  S, the spec says the server 
cannot send an FPDU until it first receives one and thus _knows_ the 
client is in RDMA mode (by virtue of the fact that the client sent and 
FPDU).




server sends MPA Start response with lets do RTR and send me X  where
X could be 0B write, 0B read request or 0B send.



Are there any restrictions where a client may not be able to issue what the
server requests?  E.g. the hardware doesn't issue 0B writes.

  


Well I guess there could be.  The concensus within the iWARP vendors 
at Reno was that 0B read would ok.  During the previous discussion on 
this list shortly after Reno, issues where raised that we should allow 
other types.


We could make the MPA start request have more info than I can do 
RTR.   It could have Here are the RTR msgs I can send.Does that 
help?




Steve.




___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [ofa-general] Re: [PATCH] Request For Comments:

2008-05-08 Thread Steve Wise
Here is the thread where we discussed how to implement peer-to-peer for 
iWARP in Nov/2007:


http://lists.openfabrics.org/pipermail/general/2007-November/043252.html



Steve Wise wrote:



From RFC 5044, section 7.1.2 Connection Startup Rules, Page 29:

4.  MPA Responder mode implementations MUST receive and validate at
  least one FPDU before sending any FPDUs or Markers.

  Note: This requirement is present to allow the Initiator time to
  get its receiver into Full Operation before an FPDU arrives,
  avoiding potential race conditions at the Initiator.  This
  was also subject to some debate in the work group before
  rough consensus was reached.  Eliminating this requirement
  would allow faster startup in some types of applications.
  However, that would also make certain implementations
  (particularly dual stack) much harder.



___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [ofa-general] Re: [PATCH] Request For Comments:

2008-05-07 Thread Roland Dreier
   I'm just trying to define the scope of the issue here... so is there any
   conceivable real-life situation where neither a 0B read nor a 0B write
   would work, and the connection setup will have to use a 0B send?

  i'm not sure what you mean by real-life.  For the rnics we have:
  
  nes - requires 0b write
  cxgb3 - requires 0b read
  amso1100 - won't work in p2p mode
  
  So there are none that I know of that require a send for this.

I guess my question was whether we expect to ever need to worry about
the 0B send case, or whether it's just theoretical.  If no current NICs
have a problem with read or write, and future NICs will be built to a
future MPA spec, then it seems we don't have to worry about what happens
if a 0B send is done as part of connection setup.

The spurious CQE on connection failure and the private data breakage are
serious obviously.  The interoperability issues of this stuff seem
pretty painful to me.

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [ofa-general] Re: [PATCH] Request For Comments:

2008-05-07 Thread Steve Wise



Sean Hefty wrote:

  nes - requires 0b write
  cxgb3 - requires 0b read
  amso1100 - won't work in p2p mode


I'm assuming by requires that you, uhm, mean requires, and nes couldn't do 0b
reads, or cxgb3 0b writes.

Well, I'm not sure about nes.  But cxgb3 cannot deal with receiving a 0B 
write for the RTR because the FW doesn't see incoming writes, nor does 
the driver.


nes may be able to request a 0b read, but I what I meant was they 
currently use a 0B write and not a read.


So its possible to reduce the complexity if we just mandate 0B read for 
RTR.  But it makes sense in my mind to allow the other message types...



Its is painful.  But without anything, you cannot run OMPI, IMPI or
HPMPI on a iwarp cluster with mixed vendor rnics...


Is there any requirement at the receiving side, versus the initiating side?
That is, just because nes issues a 0b write, does the receiving HW care if a
read or write shows up?  Or is this restriction on both sides?

The requirement is mostly driven from the receiving side.  For cxgb3 it 
is anyway...


The receiving side, ie the side that issues the rdma_accept will tell 
the sending side what RTR message to send, if any.  So the MPA exchange 
will look like this:


client sends MPA Start request with private data saying i can send an 
RTR if you want it.


server moves connection into RDMA mode

server sends MPA Start response with lets do RTR and send me X  where 
X could be 0B write, 0B read request or 0B send.


client moves connection into RDMA mode

client sends X and then enables SQ processing (or indicate ESTABLISHED)

Once server gets X it can enable SQ processing (or indicate ESTABLISHED)

If X was a 0B read request, server sends 0B read response.



Steve
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [ofa-general] Re: [PATCH] Request For Comments:

2008-05-06 Thread Steve Wise

Caitlin Bestler wrote:

On Tue, May 6, 2008 at 11:32 AM, Steve Wise [EMAIL PROTECTED] wrote:
  

Roland Dreier wrote:



  - always do peer2peer and don't let the app choose.  This forces
  the overhead of p2p mode on all apps, but preserves the API.

How bad is the overhead?

 - R.


  

 The client side must send a Ready To Receive message.  This will be
negotiated via the MPA exchange and the resulting RTR message may be a 0B
read + read response, 0B write, or a 0B send.  For chelsio, the 0B write
couldn't be used, and the 0B read was the least impact on the driver code,
so we used that.  For nes, they currently use a 0B write.

 Also, there are some caveats if you turn this on:

 1) private data is used to negotiate the type of RTR message and if its
needed.   This is more of a global module option I think, since it will
break interoperability with iwarp.  Prolly will bump the MPA version number
if this option is on too.

 2) if the RTR message fails, it can generate a CQE that is unexpected.

 3) if using SEND, then a recv completion is always generated.

 Steve.






Keep in mind that even if it is a zero byte RDMA Write, it is still a distinct
packet that needs TCP handling, will occupy a buffer in various switch
queues, etc.

So while it can be about as innocuous as any TCP segment can be, it
is still an excess packet if it did not need to be sent. The overwhelming
majority of applications use a client/server model rather than peer2peer.
For them this is an excess wire packet, so I think that would make it
excessive overhead.

Secondly, the applications that need this feature will generally know
that they need it. Developers of MPI and other peer-2-peer applications
tend to know advanced networking a bit more than typical app developers.
So keeping the default to match the client/server model makes sense.
  


What are the overwhelming majority of user mode rdma applications that 
don't assume a peer2peer model?


Steve.


___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg