Patrick Geoffray wrote:
Eugene Loh wrote:
  
To recap:
1) The work is already done.
    
How do you do "directed polling" with ANY_TAG ?
Not sure I understand the question.  So, maybe we start by being explicitly about what we mean by "directed polling".

Currently, the sm BTL has connection-based FIFOs.  That is, for each on-node sender/receiver (directed) pair, there is a FIFO.  For a receiver to receive messages, it needs to check its in-bound FIFOs.  It can check all in-bound FIFOs all the time to discover messages.  By "directed polling", I mean that if the user posts a receive from a specified source, we poll only the FIFO on which that message is expected.

With that in mind, let's go back to your question.  If a user posts a receive with a specified source but a wildcard tag, we go to the specified FIFO.  We check the item on the FIFO's tail.  We check if this item is the one we're looking for.  The "ANY_TAG" comes into play only here, on the matching.  It's unrelated to "directed polling", which has to do only with the source process.

Possibly, you meant to ask how one does directed polling with a wildcard source MPI_ANY_SOURCE.  If that was your question, the answer is we punt.  We report failure to the ULP, which reverts to the standard code path.

One alternative is, of course, the single receiver queue.  I agree that that alternative has many merits.  To recap, however, the proposed optimizations are already "in the bag" (implemented in a workspace) and address some optimizations that are orthogonal to the "directed polling" (and single receiver queue) approach.  I think there are also some uncertainties about the single recv queue approach, but I guess I'll just have to prototype that alternative to explore those uncertainties.
How do you ensure you check all incoming queues from time to time to prevent flow control (specially if the queues are small for scaling) ?
There are a variety of choices here.  Further, I'm afraid we ultimately have to expose some of those choices to the user (MCA parameters or something).

Let's say some congestion is starting to build on some internal OMPI resource.  Arguably, we should do something to start relieving that congestion.  What if then the user code posts a rather specific request (receive a message with a particular tag on a particular communicator from a particular source) and with high urgency (blocking request... "I ain't going anywhere until you give me what I'm asking for").  A good servant would drop whatever else s/he is doing to oblige the boss.

So, let's say there's a standard MPI_Recv.  Let's say there's also some congestion starting to build.  What should the MPI implementation do?  Alternatives include:
A) If the receive can be completed "immediately", then do so and return control to the user as soon as possible.
B) If the receive cannot be completed "immediately", fill your wait time with general housekeeping like relieving congested resources.
C) Figure out what's on the critical path and do it.

At least A should be available for the user.  Probably also B, and the RFC proposal allows for that by rolling over to the traditional code path when the request cannot be satisfied "immediately".  (That said, there are different definitions of "immediately" and different ways of implementing all this.)

The definitions I've used for "immediately" include:
*) We know which FIFO to check.
*) The message is the next item on that FIFO.
*) The message is being delivered entirely in one chunk.

I am also going to add a time-out.

One could also mix a little bit of general polling in.  (Unfortunately), there is no end to all the artful tuning one could do.
What about the one-sided that Brian mentioned where there is no corresponding receive to tell you which queue to poll ?
  
I appreciate Jeff's explanation, but I still don't understand this 100%.  The receive side looks to see if it can handle the request "immediately".  It checks to see if the next item on the specified FIFO is "the one".  If it is, it completes the request.  If not, it returns control to the ULP, who rolls over to the traditional code path.

I don't 100% know how to handle the concern you/Brian raise, but I have the PML passing the flag MCA_PML_OB1_HDR_TYPE_MATCH into the BTL, saying "this is the kind of message to look for".  Does this address the concern?  The intent is that if it encounters something it doesn't know how to handle, it reverts to the traditional receive code path.
If you want to handle all the constraints, a single-queue model is much less work in the end, IMHO.
  
Again, important speedups appear to be achievable if one bypasses the PML receive-request data structure.  So, we're talking about optimizations that are orthogonal to the single-queue issue.
2) The single-queue model addresses only one of the RFC's issues.
    
The single-queue model addresses not only the latency overhead when
scaling, but also the exploding memory footprint.
Right.  Very attractive.  I'm not ruling out the single-queue model.
By experience, the linear overhead of polling N queues very quickly
become greater than all the optimizations you can do on the send side.
  
Yes, and you could toss the receive-side optimizations as well.  So, one could say, "Our np=2 latency remains 2x slower than Scali's, but at least we no longer have that hideous scaling with large np."  Maybe that's where we want to end up.


Reply via email to