Sorry for the length of this mail.  It's a complex issue.  :-\

I did everything needed to enable the IB and RDMA CM's to have their own progress threads to handle incoming CM traffic (which is important because both CM's have timeouts for all their communications) and it seems to be working fine for simple examples. I posted an hg of this work (regularly kept in sync with the trunk):

http://www.open-mpi.org/hg/hgwebdir.cgi/jsquyres/openib-fd- progress/

But in talking to Pasha today, we realized that there are big problems which will undoubtedly show up when running more than trivial examples.

==> Remember that the goal for this work was to have a separate progress thread *without* all the heavyweight OMPI thread locks. Specifically: make it work in a build without --enable-progress- threads or --enable-mpi-threads (we did some preliminary testing with that stuff enabled and it had a big performance impact).

1. When CM progress thread completes an incoming connection, it sends a command down a pipe to the main thread indicating that a new endpoint is ready to use. The pipe message will be noticed by opal_progress() in the main thread and will run a function to do all necessary housekeeping (sets the endpoint state to CONNECTED, etc.). But it is possible that the receiver process won't dip into the MPI layer for a long time (and therefore not call opal_progress and the housekeeping function). Therefore, it is possible that with an active sender and a slow receiver, the sender can overwhelm an SRQ. On IB, this will just generate RNRs and be ok (we configure SRQs to have infinite RNRs), but I don't understand the semantics of what will happen on iWARP (it may terminate? I sent an off-list question to Steve Wise to ask for detail -- we may have other issues with SRQ on iWARP if this is the case, but let's skip that discussion for now).

Even if we can get the iWARP semantics to work, this feels kinda icky. Perhaps I'm overreacting and this isn't a problem that needs to be fixed -- after all, this situation is no different than what happens after the initial connection, but it still feels icky.

2. The CM progress thread posts its own receive buffers when creating a QP (which is a necessary step in both CMs). However, this is problematic in two cases:

- If posting to an SRQ, the main thread may also be [re-]posting to the SRQ at the same time. Those endpoint data structures therefore need to be protected. - All receive buffers come from the mpool, and therefore those data structures need to be protected. Specifically: both threads may post to the SRQ simultaneously, but the CM will always be the first to post to a PPRQ. So although there's no race in the PPRQ endpoint data structures, there is a potential for race issues in the mpool data structures in both cases.

This is all a problem because we explicitly do not want to enable *all* the heavyweight threading infrastructure for OMPI. I see a few options, none of which seem attractive:

1. Somehow make it so only mpool and select other portions of OMPI can have threading/lock support (although this seems like a slippery slope -- I can foresee implications that would make it completely meaningless to only have some thread locks enabled and not others). This is probably the least attractive option.

2. Make the IB and RDMA CM requests be tolerant of timing out (and just restarting). This is actually a lot of work; for example, the IBCM CPC would then need to be tolerant of timing out anywhere in its 3-way handshake and starting over again. This could have serious implications for when a connection will be able to actually complete if a receiver rarely dips into the MPI layer (much worse than RDMA CM's 2-way handshake).

3. Have locks around the critical areas described in #1 that can be enabled without --enable-<foo>-threads support (perhaps disabled at run time if we're not using a CM progress thread?).

4. Have a separate mpool for drawing initial receive buffers for the CM-posted RQs. We'd probably want this mpool to be always empty (or close to empty) -- it's ok to be slow to allocate / register more memory when a new connection request arrives. The memory obtained from this mpool should be able to be returned to the "main" mpool after it is consumed.

5. ...?

Thoughts?

--
Jeff Squyres
Cisco Systems

Reply via email to