1. When CM progress thread completes an incoming connection, it sends a command down a pipe to the main thread indicating that a new endpoint is ready to use. The pipe message will be noticed by opal_progress() in the main thread and will run a function to do all necessary housekeeping (sets the endpoint state to CONNECTED, etc.). But it is possible that the receiver process won't dip into the MPI layer for a long time (and therefore not call opal_progress and the housekeeping function). Therefore, it is possible that with an active sender and a slow receiver, the sender can overwhelm an SRQ. On IB, this will just generate RNRs and be ok (we configure SRQs to have infinite RNRs), but I don't understand the semantics of what will happen on iWARP (it may terminate? I sent an off-list question to Steve Wise to ask for detail -- we may have other issues with SRQ on iWARP if this is the case, but let's skip that discussion for now).

Is it possible to have sane SRQ implementation without HW flow control?
Anyway the described problem exists with SRQ right now too. If receiver
doesn't enter progress for a long time sender can overwhelm an SRQ.
I don't see how this can be fixed without progress thread (and I am not
even sure that this is the problem that has to be fixed).
It may be resolved particularly by srq_limit_event (this event is generated when number posted receive buffer come down under predefined watermark ) But I'm not sure that we want to move the RNR problem from sender side to receiver.

The full solution will be progress thread + srq_limit_event.

Even if we can get the iWARP semantics to work, this feels kinda icky. Perhaps I'm overreacting and this isn't a problem that needs to be fixed -- after all, this situation is no different than what happens after the initial connection, but it still feels icky.
What is so icky about it? Sender is faster than a receiver so flow control
kicks in.

2. The CM progress thread posts its own receive buffers when creating a QP (which is a necessary step in both CMs). However, this is problematic in two cases:

[skip]
I don't like 1,2 and 3. :(
If Iwarp may handle RNR , #1 - sounds ok for me, at least for 1.3.
4. Have a separate mpool for drawing initial receive buffers for the
CM-posted RQs.  We'd probably want this mpool to be always empty (or
close to empty) -- it's ok to be slow to allocate / register more
memory when a new connection request arrives.  The memory obtained
from this mpool should be able to be returned to the "main" mpool
after it is consumed.

This is slightly better, but still...

5. ...?
What about moving posting of receive buffers into main thread. With
SRQ it is easy: don't post anything in CPC thread. Main thread will
prepost buffers automatically after first fragment received on the
endpoint (in btl_openib_handle_incoming()).
It still doesn't guaranty that we will not see RNR (as I understand we trying to resolve this problem for iwarp?!)

So this solution will cost 1 buffer on each srq ... sounds acceptable for me. But I don't see too much difference compared to #1, as I understand we anyway will be need the pipe for communication with main thread.
so why don't use #1 ?
With PPRQ it's more
complicated. What if we'll prepost dummy buffers (not from free list)
during IBCM connection stage and will run another three way handshake
protocol using those buffers, but from the main thread. We will need to
prepost one buffer on the active side and two buffers on the passive side.

--
                        Gleb.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to