On May 19, 2008, at 8:25 AM, Gleb Natapov wrote:

Is it possible to have sane SRQ implementation without HW flow control?

It seems pretty unlikely if the only available HW flow control is to terminate the connection. ;-)

Even if we can get the iWARP semantics to work, this feels kinda
icky. Perhaps I'm overreacting and this isn't a problem that needs to
be fixed -- after all, this situation is no different than what
happens after the initial connection, but it still feels icky.
What is so icky about it? Sender is faster than a receiver so flow control
kicks in.

My point is that we have no real flow control for SRQ.

2. The CM progress thread posts its own receive buffers when creating
a QP (which is a necessary step in both CMs).  However, this is
problematic in two cases:

[skip]

I don't like 1,2 and 3. :(

4. Have a separate mpool for drawing initial receive buffers for the
CM-posted RQs.  We'd probably want this mpool to be always empty (or
close to empty) -- it's ok to be slow to allocate / register more
memory when a new connection request arrives.  The memory obtained
from this mpool should be able to be returned to the "main" mpool
after it is consumed.

This is slightly better, but still...

Agreed; my reactions were pretty much the same as yours.

5. ...?
What about moving posting of receive buffers into main thread. With
SRQ it is easy: don't post anything in CPC thread. Main thread will
prepost buffers automatically after first fragment received on the
endpoint (in btl_openib_handle_incoming()). With PPRQ it's more
complicated. What if we'll prepost dummy buffers (not from free list)
during IBCM connection stage and will run another three way handshake
protocol using those buffers, but from the main thread. We will need to prepost one buffer on the active side and two buffers on the passive side.


This is probably the most viable alternative -- it would be easiest if we did this for all CPC's, not just for IBCM:

- for PPRQ: CPCs only post a small number of receive buffers, suitable for another handshake that will run in the upper-level openib BTL - for SRQ: CPCs don't post anything (because the SRQ already "belongs" to the upper level openib BTL)

Do we have a BSRQ restriction that there *must* be at least one PPRQ? If so, we could always run the upper-level openib BTL really-post-the- buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e., have the CPC post a single receive on this QP -- see below), which would make things much easier. If we don't already have this restriction, would we mind adding it? We have one PPRQ in our default receive_queues value, anyway.

With this rationale, once the CPC says "ok, all BSRQ QP's are connected", then _endpoint.c can run a CTS handshake to post the "real" buffers, where each side does the following:

- CPC calls _endpoint_connected() to tell the upper level BTL that it is fully connected (the function is invoked in the main thread) - _endpoint_connected() posts all the "real" buffers to all the BSRQ QP's on the endpoint - _endpoint_connected() then sends a CTS control message to remote peer via smallest RC PPRQ
- upon receipt of CTS:
  - release the buffer (***)
- set endpoint state of CONNECTED and let all pending messages flow... (as it happens today)

So it actually doesn't even have to be a handshake -- it's just an additional CTS sent over the newly-created RC QP. Since it's RC, we don't have to do much -- just wait for the CTS to know that the remote side has actually posted all the receives that we expect it to have. Since the CTS flows over a PPRQ, there's no issue about receiving the CTS on an SRQ (because the SRQ may not have any buffers posted at any given time).

(***) The CTS can even be a zero byte message (maybe with inline data if we need it?); we're just waiting for the *first* message to arrive on the smallest BSRQ PPQP. Here's a dumb question (because I've never tried it and am on a plane where I can't try it) -- can you post a 0 byte buffer (or NULL) for a receive? This would make returning the buffer to the CPC much easier (i.e., you won't have to) because the CPC [thread] will post the receive, but the upper level openib BTL [main thread] will actually receive it.

We still have to solve what happens with iWARP on SRQ's, but that's really a different issue. I don't know if the iWARP vendors have thought about this much yet...?

--
Jeff Squyres
Cisco Systems

Reply via email to