Re: [OMPI devel] Threaded progress for CPCs

Pavel Shamis (Pasha) Mon, 19 May 2008 10:08:28 -0400

1. When CM progress thread completes an incoming connection, it sendsa command down a pipe to the main thread indicating that a newendpoint is ready to use. The pipe message will be noticed byopal_progress() in the main thread and will run a function to do allnecessary housekeeping (sets the endpoint state to CONNECTED, etc.).But it is possible that the receiver process won't dip into the MPIlayer for a long time (and therefore not call opal_progress and thehousekeeping function). Therefore, it is possible that with an activesender and a slow receiver, the sender can overwhelm an SRQ. On IB,this will just generate RNRs and be ok (we configure SRQs to haveinfinite RNRs), but I don't understand the semantics of what willhappen on iWARP (it may terminate? I sent an off-list question toSteve Wise to ask for detail -- we may have other issues with SRQ oniWARP if this is the case, but let's skip that discussion for now).
Is it possible to have sane SRQ implementation without HW flow control?
Anyway the described problem exists with SRQ right now too. If receiver
doesn't enter progress for a long time sender can overwhelm an SRQ.
I don't see how this can be fixed without progress thread (and I am not
even sure that this is the problem that has to be fixed).

It may be resolved particularly by srq_limit_event (this event isgenerated when number posted receive buffer come down under predefinedwatermark )But I'm not sure that we want to move the RNR problem from sender sideto receiver.


The full solution will be progress thread + srq_limit_event.

Even if we can get the iWARP semantics to work, this feels kindaicky. Perhaps I'm overreacting and this isn't a problem that needs tobe fixed -- after all, this situation is no different than whathappens after the initial connection, but it still feels icky.
What is so icky about it? Sender is faster than a receiver so flow control
kicks in.
2. The CM progress thread posts its own receive buffers when creatinga QP (which is a necessary step in both CMs). However, this isproblematic in two cases:
[skip]
I don't like 1,2 and 3. :(

If Iwarp may handle RNR , #1 - sounds ok for me, at least for 1.3.

4. Have a separate mpool for drawing initial receive buffers for the
CM-posted RQs.  We'd probably want this mpool to be always empty (or
close to empty) -- it's ok to be slow to allocate / register more
memory when a new connection request arrives.  The memory obtained
from this mpool should be able to be returned to the "main" mpool
after it is consumed.


This is slightly better, but still...

5. ...?

What about moving posting of receive buffers into main thread. With
SRQ it is easy: don't post anything in CPC thread. Main thread will
prepost buffers automatically after first fragment received on the

endpoint (in btl_openib_handle_incoming()).

It still doesn't guaranty that we will not see RNR (as I understand wetrying to resolve this problem for iwarp?!)

So this solution will cost 1 buffer on each srq ... sounds acceptablefor me. But I don't see too muchdifference compared to #1, as I understand we anyway will be need thepipe for communication with main thread.

so why don't use #1 ?

With PPRQ it's more
complicated. What if we'll prepost dummy buffers (not from free list)
during IBCM connection stage and will run another three way handshake
protocol using those buffers, but from the main thread. We will need to
prepost one buffer on the active side and two buffers on the passive side.

--
                        Gleb.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Threaded progress for CPCs

Reply via email to