-----Original Message-----
From: Steve Wise [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 21, 2007 1:07 PM
To: Kanevsky, Arkady
Cc: Glenn Grundstrom; Leonid Grossman; [EMAIL PROTECTED]
Subject: [ofa-general] Re: iWARP peer-to-peer CM proposal
Comments in-line below...
Kanevsky, Arkady wrote:
Group,
below is proposal on how to resolve peer-to-peer iWARP CM issue
discovered at interop event.
The main issue is that MPA spec (relevant portion of
IETF RFC 5044
is below) require that
connection initiator send first message over the
established connection.
Multiple MPI implementations and several other apps use
peer-to-peer
model.
So rather then forcing all of them to do it on their
own, which will
not help with
interop between different implementations, the goal is to extend
lower layers to provide it.
Our first idea was to leave MPA protocol untouched and
try to solve
this problem
in iw_cm. But there are too many complications to it. First, in
order to adhere to RFC5044
initiator must send first FPDU and responder process
it. But since
the connection is already
established processing FPDU involves ULP on whose behalf the
connection is created.
So either initiator sends a message which generates
completion on
responder CQ, thus visible
to ULP, or not.
In the later case, the only op which can do it is
RDMA one, which means
that responder somehow provided initiator S-tag which
it can use.
So, this is an extension
to MPA, probably using private data. And that responder upon
receiving it destroy this S-tag.
In any case this is an extension of MPA.
This stag exchange isn't needed if this RDMA op is a 0B READ.
The responder waits for that 0B read and only indicates the
rdma connection is established to its ULP when it replies to
the 0B read. In this scenario, the responder/server side
doesn't consume any CQ resources.
But it would require an IRD of at least 1 to be configured on the QP.
The initiator still requires an SQ entry, and possibly a CQ
entry, for initiating the 0B read and handling completion.
But its perhaps a little less painful than doing a SEND/RECV
exchange. The read wr could be unsignaled so that it won't
generate a CQE. But it still consumes an SQ WR slot so the
SQ would have to be sized to allow this extra WR. And I guess
the CQ would also need to be sized accordingly in case the
read failed.
In the former, Send is used but this requires a buffer
to be posted
to CQ. But since
the same CQ (or SharedCQ) can be used by other
connections at the
same time it can cause
the responder CM posted buffer to be consumed by other
connection.
This is not acceptable.
So new we consider extension to MPA protocol.
The goal is to be completely backwards compatible to
existing version 1.
In a nutshell, use a "flag" in the MPA request message which
indicates that
"ready to receive" message will be send by requestor upon
receiving
MPA response message with connection acceptance.
here are the changes to IETF RFC5044
1. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
2 3 4 5 6 7 8
9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0
| | + Key (16 bytes containing "MPA ID Req Frame") + 4
| (4D 50 41
20 49 44 20 52 65 71 20 46 72 61 6D 65) | + Or (16
bytes containing
"MPA ID Rep Frame") + 8 | (4D 50 41 20 49 44 20 52 65
70 20 46 72 61
6D 65) | + Or (16 bytes containing "MPA ID Rtr Frame")
+ 12 | (4D 50
41 20 49 44 20 52 74 52 20 46 72 61 6D 65) | +
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16
|M|C|R|S| Res | Rev | PD_Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| ~ ~ ~ Private Data ~ | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
2. S: indicator in the Req frame whether or not
Requestor will send
Rtr frame.
In Req frame, if set to 1 then Rtr frame will be sent if
responder
sends Rep frame with accept bit set. 0 indicate
that Rtr frame
will not be sent.
In Rep frame, 0 means that Responder cannot support
Rtr frame,
while 1 that it is and is waiting for it.
(While my preference is to handle this as MPA
protocol version
matching rules,
proposed method will provide complete backwards
compatibility)
Unused by Rtr frame. That is set to 0 in Rtr frame
and ignored
by responder.
All other bits M,C,R and remainder of Res treated
as in MPA ver 1.
Rtr frame adhere to C bit as specified in Rep frame
First, the RTR frame _must_ be an FPDU for this to work.
Thus it violates the DDP/RDMAP specs because it is an known
DDP/RDMAP opcode.
Second, assuming the RTR frame is sent as an FPDU, then this
won't work with existing RNIC HW. The HW will post an async
error because the incoming DDP/RDMAP opcode is unknown.
The only way I see that we can fix this for the existing rnic
HW is to come up with some way to send a valid RDMAP message
from the initiator to the responder under the covers -and-
have the responder only indicate that the connection is
established when that FPDU is received.
Chelsio cannot support this hack via a 0B write, but the
could support a 0B read or send/recv exchange. But as you
indicate, this is very painful and perhaps impossible to do
without impacting the ULP and breaking verbs semantics.
(that's why we punted on this a year ago :)
Steve.
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general