[ewg] RE: [ofa-general] Re: [PATCH] Request For Comments:
The requirement is mostly driven from the receiving side. For cxgb3 it is anyway... Maybe you can help me understand the spec here. If we ignore this feature for a minute, then the side that calls rdma_connect() must instead issue the first 'send' request to the server. Can the first 'send' be a 0B rdma write or read? Why wouldn't the target of that request not have to transition to connected? Is the issue that there's no way for the receiving FW/driver to know that this has occurred so that it can signal that the connection has been established? I.e. a client that does this must signal the server that things are ready through some out of band means. server sends MPA Start response with lets do RTR and send me X where X could be 0B write, 0B read request or 0B send. Are there any restrictions where a client may not be able to issue what the server requests? E.g. the hardware doesn't issue 0B writes. - Sean ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [ofa-general] Re: [PATCH] Request For Comments:
From RFC 5044, section 7.1.2 Connection Startup Rules, Page 29: 4. MPA Responder mode implementations MUST receive and validate at least one FPDU before sending any FPDUs or Markers. Note: This requirement is present to allow the Initiator time to get its receiver into Full Operation before an FPDU arrives, avoiding potential race conditions at the Initiator. This was also subject to some debate in the work group before rough consensus was reached. Eliminating this requirement would allow faster startup in some types of applications. However, that would also make certain implementations (particularly dual stack) much harder. Steve Wise wrote: Sean Hefty wrote: The requirement is mostly driven from the receiving side. For cxgb3 it is anyway... Maybe you can help me understand the spec here. If we ignore this feature for a minute, then the side that calls rdma_connect() must instead issue the first 'send' request to the server. Can the first 'send' be a 0B rdma write or read? According to the MPI IETF RFC, the initiator must send the first FPDU. That could be anything. The spec leaves it up to the ULP. Why wouldn't the target of that request not have to transition to connected? I don't understand this question? What does 'transition to connected' mean? The requirement is that the responder (the side that issues the rdma_accept in rdma-cma terms) _cannot_ send an FPDU until it first receives one from the initiator. How that is enforces is an implementation detail. The responder driver could hold off on the ESTABLISHED event until it receives the first FPDU. Or it could stall SQ processing until the first FPDU is received yet still indicate that the connection is ESTABLISHED. Is the issue that there's no way for the receiving FW/driver to know that this has occurred so that it can signal that the connection has been established? I.e. a client that does this must signal the server that things are ready through some out of band means. I don't understand what you're getting at exactly. The issue is that the server doesn't know when the client receives the MPA Start Response and has successfully transitioned the connection into RDMA mode. IF the server sends an FPDU immediately following the MPA Start Response (which is in streaming mode), then its possible for that first FPDU to get passed up to the driver/ULP as streaming mode data. Which breaks everything. S, the spec says the server cannot send an FPDU until it first receives one and thus _knows_ the client is in RDMA mode (by virtue of the fact that the client sent and FPDU). server sends MPA Start response with lets do RTR and send me X where X could be 0B write, 0B read request or 0B send. Are there any restrictions where a client may not be able to issue what the server requests? E.g. the hardware doesn't issue 0B writes. Well I guess there could be. The concensus within the iWARP vendors at Reno was that 0B read would ok. During the previous discussion on this list shortly after Reno, issues where raised that we should allow other types. We could make the MPA start request have more info than I can do RTR. It could have Here are the RTR msgs I can send.Does that help? Steve. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [ofa-general] Re: [PATCH] Request For Comments:
Here is the thread where we discussed how to implement peer-to-peer for iWARP in Nov/2007: http://lists.openfabrics.org/pipermail/general/2007-November/043252.html Steve Wise wrote: From RFC 5044, section 7.1.2 Connection Startup Rules, Page 29: 4. MPA Responder mode implementations MUST receive and validate at least one FPDU before sending any FPDUs or Markers. Note: This requirement is present to allow the Initiator time to get its receiver into Full Operation before an FPDU arrives, avoiding potential race conditions at the Initiator. This was also subject to some debate in the work group before rough consensus was reached. Eliminating this requirement would allow faster startup in some types of applications. However, that would also make certain implementations (particularly dual stack) much harder. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [ofa-general] Re: [PATCH] Request For Comments:
I'm just trying to define the scope of the issue here... so is there any conceivable real-life situation where neither a 0B read nor a 0B write would work, and the connection setup will have to use a 0B send? i'm not sure what you mean by real-life. For the rnics we have: nes - requires 0b write cxgb3 - requires 0b read amso1100 - won't work in p2p mode So there are none that I know of that require a send for this. I guess my question was whether we expect to ever need to worry about the 0B send case, or whether it's just theoretical. If no current NICs have a problem with read or write, and future NICs will be built to a future MPA spec, then it seems we don't have to worry about what happens if a 0B send is done as part of connection setup. The spurious CQE on connection failure and the private data breakage are serious obviously. The interoperability issues of this stuff seem pretty painful to me. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [ofa-general] Re: [PATCH] Request For Comments:
Sean Hefty wrote: nes - requires 0b write cxgb3 - requires 0b read amso1100 - won't work in p2p mode I'm assuming by requires that you, uhm, mean requires, and nes couldn't do 0b reads, or cxgb3 0b writes. Well, I'm not sure about nes. But cxgb3 cannot deal with receiving a 0B write for the RTR because the FW doesn't see incoming writes, nor does the driver. nes may be able to request a 0b read, but I what I meant was they currently use a 0B write and not a read. So its possible to reduce the complexity if we just mandate 0B read for RTR. But it makes sense in my mind to allow the other message types... Its is painful. But without anything, you cannot run OMPI, IMPI or HPMPI on a iwarp cluster with mixed vendor rnics... Is there any requirement at the receiving side, versus the initiating side? That is, just because nes issues a 0b write, does the receiving HW care if a read or write shows up? Or is this restriction on both sides? The requirement is mostly driven from the receiving side. For cxgb3 it is anyway... The receiving side, ie the side that issues the rdma_accept will tell the sending side what RTR message to send, if any. So the MPA exchange will look like this: client sends MPA Start request with private data saying i can send an RTR if you want it. server moves connection into RDMA mode server sends MPA Start response with lets do RTR and send me X where X could be 0B write, 0B read request or 0B send. client moves connection into RDMA mode client sends X and then enables SQ processing (or indicate ESTABLISHED) Once server gets X it can enable SQ processing (or indicate ESTABLISHED) If X was a 0B read request, server sends 0B read response. Steve ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [ofa-general] Re: [PATCH] Request For Comments:
Caitlin Bestler wrote: On Tue, May 6, 2008 at 11:32 AM, Steve Wise [EMAIL PROTECTED] wrote: Roland Dreier wrote: - always do peer2peer and don't let the app choose. This forces the overhead of p2p mode on all apps, but preserves the API. How bad is the overhead? - R. The client side must send a Ready To Receive message. This will be negotiated via the MPA exchange and the resulting RTR message may be a 0B read + read response, 0B write, or a 0B send. For chelsio, the 0B write couldn't be used, and the 0B read was the least impact on the driver code, so we used that. For nes, they currently use a 0B write. Also, there are some caveats if you turn this on: 1) private data is used to negotiate the type of RTR message and if its needed. This is more of a global module option I think, since it will break interoperability with iwarp. Prolly will bump the MPA version number if this option is on too. 2) if the RTR message fails, it can generate a CQE that is unexpected. 3) if using SEND, then a recv completion is always generated. Steve. Keep in mind that even if it is a zero byte RDMA Write, it is still a distinct packet that needs TCP handling, will occupy a buffer in various switch queues, etc. So while it can be about as innocuous as any TCP segment can be, it is still an excess packet if it did not need to be sent. The overwhelming majority of applications use a client/server model rather than peer2peer. For them this is an excess wire packet, so I think that would make it excessive overhead. Secondly, the applications that need this feature will generally know that they need it. Developers of MPI and other peer-2-peer applications tend to know advanced networking a bit more than typical app developers. So keeping the default to match the client/server model makes sense. What are the overwhelming majority of user mode rdma applications that don't assume a peer2peer model? Steve. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg