Re: [OMPI devel] sm BTL flow management

Eugene Loh Thu, 25 Jun 2009 16:10:35 -0400

Eugene Loh wrote:

If you look in mca_btl_sm_component_progress, when a process receivesa message fragment and returns it to the sender, it executes code likethis:
     goto recheck_peer;
     break;
Okay, the reason I show you that code is because a static code checkershould easily identify the break statement as dead code. It'll neverbe reached. Anyhow, in English, what's happening is if you receive amessage fragment, you keep polling your FIFO. So, consider the caseof half-duplex point-to-point traffic: one process only sends and theother process only receives. Previously, this would eventually hang.Now, it won't. But (I haven't confirmed 100% yet), I don't think itexecutes very pleasantly. E.g., if you have
     for ( i = 0; i < N; i++ ) {
          if ( me == 0 ) MPI_Send(...);
          if ( me == 1 ) MPI_Recv(...);
     }
At some point, the receiver falls hopelessly behind. The sender keepspumping messages and the receiver keeps polling its FIFO, pulling inmessages and returning fragments to the sender so that the sender cankeep on going. Problem is, all that is happening within one MPI_Recvcall... which in a test code might be pulling in 100Ks of messages.The MPI_Recv call won't return until the sender lets up. Then, therest of the MPI_Recv calls will execute, all pulling messages out ofthe local unexpected-message queue.
Not sure yet how I want to manage this. The bottom line might be thatif the MPI application has no flow control, the underlying MPIimplementation is going to have to do something that won't makeeveryone happy. Oh well. At least the program makes progress andcompletes in reason time.

I spoke with Brian and Jeff about this earlier today. Presumably, upthrough 1.2, mca_btl_component_progress would poll and if it received amessage fragment would return. Then, presumably in 1.3.0, behavior waschanged to keep polling until the FIFO was empty. Brian said this wasbased on Terry's desire to keep latency as low as possible inbenchmarks. Namely, reaching down into a progress call was a long codepath. It would be better to pick up multiple messages, if available onthe FIFO, and queue extras up in the unexpected queue. Then, asubsequent call could more efficiently find the anticipated messagefragment.

I don't see how the behavior would impact short-message pingpongs (thetypical way to measure latency) one way or the other.

I asked Terry, who struggled to remember the issue and pointed me atthis thread:http://www.open-mpi.org/community/lists/devel/2008/06/4158.php . Butthat is related to an issue that's solved if one keeps polling as longas one gets ACKs (but returns as soon as a real message fragment is found).

Can anyone shed some light on the history here? Why keep polling evenwhen a message fragment has been found? The downside of polling tooaggressively is that the unexpected queue can grow (without bounds).

Brian's proposal is to set some variable that determines how manymessage fragments a single mca_btl_sm_component_progress call can drainfrom the FIFO before returning.


Thanks for any discussion, insight, or historical recollections.

Re: [OMPI devel] sm BTL flow management

Reply via email to