1. When CM progress thread completes an incoming connection, it sends
a command down a pipe to the main thread indicating that a new
endpoint is ready to use. The pipe message will be noticed by
opal_progress() in the main thread and will run a function to do all
necessary housekeeping (sets the endpoint state to CONNECTED, etc.).
But it is possible that the receiver process won't dip into the MPI
layer for a long time (and therefore not call opal_progress and the
housekeeping function). Therefore, it is possible that with an active
sender and a slow receiver, the sender can overwhelm an SRQ. On IB,
this will just generate RNRs and be ok (we configure SRQs to have
infinite RNRs), but I don't understand the semantics of what will
happen on iWARP (it may terminate? I sent an off-list question to
Steve Wise to ask for detail -- we may have other issues with SRQ on
iWARP if this is the case, but let's skip that discussion for now).
Is it possible to have sane SRQ implementation without HW flow control?
Anyway the described problem exists with SRQ right now too. If receiver
doesn't enter progress for a long time sender can overwhelm an SRQ.
I don't see how this can be fixed without progress thread (and I am not
even sure that this is the problem that has to be fixed).
It may be resolved particularly by srq_limit_event (this event is
generated when number posted receive buffer come down under predefined
watermark )
But I'm not sure that we want to move the RNR problem from sender side
to receiver.
The full solution will be progress thread + srq_limit_event.
Even if we can get the iWARP semantics to work, this feels kinda
icky. Perhaps I'm overreacting and this isn't a problem that needs to
be fixed -- after all, this situation is no different than what
happens after the initial connection, but it still feels icky.
What is so icky about it? Sender is faster than a receiver so flow control
kicks in.
2. The CM progress thread posts its own receive buffers when creating
a QP (which is a necessary step in both CMs). However, this is
problematic in two cases:
[skip]
I don't like 1,2 and 3. :(
If Iwarp may handle RNR , #1 - sounds ok for me, at least for 1.3.
4. Have a separate mpool for drawing initial receive buffers for the
CM-posted RQs. We'd probably want this mpool to be always empty (or
close to empty) -- it's ok to be slow to allocate / register more
memory when a new connection request arrives. The memory obtained
from this mpool should be able to be returned to the "main" mpool
after it is consumed.
This is slightly better, but still...
5. ...?
What about moving posting of receive buffers into main thread. With
SRQ it is easy: don't post anything in CPC thread. Main thread will
prepost buffers automatically after first fragment received on the
endpoint (in btl_openib_handle_incoming()).
It still doesn't guaranty that we will not see RNR (as I understand we
trying to resolve this problem for iwarp?!)
So this solution will cost 1 buffer on each srq ... sounds acceptable
for me. But I don't see too much
difference compared to #1, as I understand we anyway will be need the
pipe for communication with main thread.
so why don't use #1 ?
With PPRQ it's more
complicated. What if we'll prepost dummy buffers (not from free list)
during IBCM connection stage and will run another three way handshake
protocol using those buffers, but from the main thread. We will need to
prepost one buffer on the active side and two buffers on the passive side.
--
Gleb.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel