[OMPI devel] Threaded progress for CPCs

Jeff Squyres Sun, 18 May 2008 11:38:44 -0400

Sorry for the length of this mail.  It's a complex issue.  :-\

I did everything needed to enable the IB and RDMA CM's to have theirown progress threads to handle incoming CM traffic (which is importantbecause both CM's have timeouts for all their communications) and itseems to be working fine for simple examples. I posted an hg of thiswork (regularly kept in sync with the trunk):

http://www.open-mpi.org/hg/hgwebdir.cgi/jsquyres/openib-fd-progress/

But in talking to Pasha today, we realized that there are big problemswhich will undoubtedly show up when running more than trivial examples.

==> Remember that the goal for this work was to have a separateprogress thread *without* all the heavyweight OMPI thread locks.Specifically: make it work in a build without --enable-progress-threads or --enable-mpi-threads (we did some preliminary testing withthat stuff enabled and it had a big performance impact).

1. When CM progress thread completes an incoming connection, it sendsa command down a pipe to the main thread indicating that a newendpoint is ready to use. The pipe message will be noticed byopal_progress() in the main thread and will run a function to do allnecessary housekeeping (sets the endpoint state to CONNECTED, etc.).But it is possible that the receiver process won't dip into the MPIlayer for a long time (and therefore not call opal_progress and thehousekeeping function). Therefore, it is possible that with an activesender and a slow receiver, the sender can overwhelm an SRQ. On IB,this will just generate RNRs and be ok (we configure SRQs to haveinfinite RNRs), but I don't understand the semantics of what willhappen on iWARP (it may terminate? I sent an off-list question toSteve Wise to ask for detail -- we may have other issues with SRQ oniWARP if this is the case, but let's skip that discussion for now).

Even if we can get the iWARP semantics to work, this feels kindaicky. Perhaps I'm overreacting and this isn't a problem that needs tobe fixed -- after all, this situation is no different than whathappens after the initial connection, but it still feels icky.

2. The CM progress thread posts its own receive buffers when creatinga QP (which is a necessary step in both CMs). However, this isproblematic in two cases:

- If posting to an SRQ, the main thread may also be [re-]postingto the SRQ at the same time. Those endpoint data structures thereforeneed to be protected.- All receive buffers come from the mpool, and therefore thosedata structures need to be protected. Specifically: both threads maypost to the SRQ simultaneously, but the CM will always be the first topost to a PPRQ. So although there's no race in the PPRQ endpoint datastructures, there is a potential for race issues in the mpool datastructures in both cases.

This is all a problem because we explicitly do not want to enable*all* the heavyweight threading infrastructure for OMPI. I see a fewoptions, none of which seem attractive:

1. Somehow make it so only mpool and select other portions of OMPI canhave threading/lock support (although this seems like a slippery slope-- I can foresee implications that would make it completelymeaningless to only have some thread locks enabled and not others).This is probably the least attractive option.

2. Make the IB and RDMA CM requests be tolerant of timing out (andjust restarting). This is actually a lot of work; for example, theIBCM CPC would then need to be tolerant of timing out anywhere in its3-way handshake and starting over again. This could have seriousimplications for when a connection will be able to actually completeif a receiver rarely dips into the MPI layer (much worse than RDMACM's 2-way handshake).

3. Have locks around the critical areas described in #1 that can beenabled without --enable-<foo>-threads support (perhaps disabled atrun time if we're not using a CM progress thread?).

4. Have a separate mpool for drawing initial receive buffers for theCM-posted RQs. We'd probably want this mpool to be always empty (orclose to empty) -- it's ok to be slow to allocate / register morememory when a new connection request arrives. The memory obtainedfrom this mpool should be able to be returned to the "main" mpoolafter it is consumed.


5. ...?

Thoughts?

--
Jeff Squyres
Cisco Systems

[OMPI devel] Threaded progress for CPCs

Reply via email to