George Bosilca wrote:

On Jun 23, 2009, at 11:04 , Eugene Loh wrote:

The sm BTL used to have two mechanisms for dealing with congested FIFOs. One was to grow the FIFOs. Another was to queue pending sends locally (on the sender's side). I think the grow-FIFO mechanism was typically invoked and the pending-send mechanism used only under extreme circumstances (no more memory).

With the sm makeover of 1.3.2, we dropped the ability to grow FIFOs. The code added complexity and there seemed to be no need to have two mechanisms to deal with congested FIFOs. In ticket 1944, however, we see that repeated collectives can produce hangs, and this seems to be due to the pending-send code not adequately dealing with congested FIFOs.

Today, when a process tries to write to a remote FIFO and fails, it queues the write as a pending send. The only condition under which it retries pending sends is when it gets a fragment back from a remote process.

I think the logic must have been that the FIFO got congested because we issued too many sends. Getting a fragment back indicates that the remote process has made progress digesting those sends. In ticket 1944, we see that a FIFO can also get congested from too many returning fragments. Further, with shared FIFOs, a FIFO could become congested due to the activity of a third-party process.

In sum, getting a fragment back from a remote process is a poor indicator that it's time to retry pending sends.

Maybe the real way to know when to retry pending sends is just to check if there's room on the FIFO.

Why this is different than "getting a fragment back"?

I'm not sure I understand your question.

Say we have two processes, A and B. Each one has a receive queue/FIFO that can be written by its peer. Let's say A sends lots of messages to B. B keeps on returning fragments to A. So, although we're saying that A sends lots of messages to B, it is A's in-bound queue that fills up. Kind of counterintuitive. Anyhow, B keeps getting more fragments to return to A. Since A's queue is full, what this means is that B adds these fragments to its (B's) own pending-send list.

So, now the question is when B should retry items on its pending-send list. Presumably, it should retry when there is room on A's queue/FIFO. But OMPI (to date) has B retry *only* when B itself gets a fragment back. What's the logic? I assume the logic was that A's queue was filled with fragments that B had sent, so getting a fragment back would be an indication of A's queue opening up.

Why is this a poor indication? (I'm assuming this is what your question was.) Two possible reasons:

1) A's queue might have been filled with fragments that B was returning to A. So, B would get no acknowledgements back from A that progress was being made depleting the queue.

2) (New with OMPI 1.3.2, now that we have shared queues): A's queue might have been filled with activity from third party processes.

In either case, the only way B now knows whether there is room on A's queue is... to check the queue if there's room! Nothing is coming back from A to indicate that the queue is being drained.

As far as I remember the code, when we get a fragment back we add it back in the LIFO, and therefore it become the next available fragment for a send.

Yes, indeed, but I don't understand how this is relevent. The LIFOs (the private free lists where processes maintain unused fragments) don't really enter this discussion.

So, I'll try modifying MCA_BTL_SM_FIFO_WRITE. It'll start by checking if there are pending sends. If so, it'll retry them before performing the requested write. This should also help preserve ordering a little better. I'm guessing this will not hurt our message latency in any meaningful way, but I'll check this out.

Meanwhile, I wanted to check in with y'all for any guidance you might have.

Reply via email to