Re: [OMPI devel] trac ticket 1944 and pending sends

George Bosilca Wed, 24 Jun 2009 02:46:38 -0400

In other words, as long as a queue is peer based (peer not peers), themanagement of the pending send list was doing what it was supposed to,and there was no possibility of deadlock. With the new code, as athird party can fill up a remote queue, getting a fragment back [asyou stated] became a poor indicator for retry.

I don't see how the proposed solution will solve the issue without asignificant overhead. As we only call the MCA_BTL_SM_FIFO_WRITE oncebefore the fragment get into the pending list, reordering thefragments will not solve the issue. When the peers is overloaded, thefragments will end-up in the pending list, and there is nothing to getit out of there except a message from the peer. In some cases, such amessage might never be delivered, simply because the peer doesn't haveany data to send us.

The other solution is to always check all pending lists. While thismight work, it will certainly add undesirable overhead to the send path.

You last patch was doing the right thing. Globally decreasing the sizeof the memory used by the MPI library is _the right_ way to go.Unfortunately, your patch only address this at the level of the sharedmemory file. Now, instead of using less memory we use even morebecause we have to store that data somewhere ... in the fragmentsreturned by the btl_sm_alloc function. These fragments are allocatedon demand and by default there is no limit to the number of suchfragments.

Here is a simple fix for both problems. Enforce a reasonable limit onthe number of fragments in the BTL free list (1K should be more thanenough), and make sure the fifo has a size equal to p *number_of_allowed_fragments_in_the_free_list, where p is the number oflocal processes. While this solution will certainly increase again thesize of the mapped file, it will do it by a small margin compared withwhat is happening today in the code. This is without talking about thefact that it will solve the deadlock problem, by removing theinability to return a fragment. In addition, the PML is capable ofhanding such situations, so we're getting back to a deadlock free smBTL.


  george.


On Jun 23, 2009, at 11:04 , Eugene Loh wrote:

The sm BTL used to have two mechanisms for dealing with congestedFIFOs. One was to grow the FIFOs. Another was to queue pendingsends locally (on the sender's side). I think the grow-FIFOmechanism was typically invoked and the pending-send mechanism usedonly under extreme circumstances (no more memory).
With the sm makeover of 1.3.2, we dropped the ability to growFIFOs. The code added complexity and there seemed to be no need tohave two mechanisms to deal with congested FIFOs. In ticket 1944,however, we see that repeated collectives can produce hangs, andthis seems to be due to the pending-send code not adequately dealingwith congested FIFOs.
Today, when a process tries to write to a remote FIFO and fails, itqueues the write as a pending send. The only condition under whichit retries pending sends is when it gets a fragment back from aremote process.
I think the logic must have been that the FIFO got congested becausewe issued too many sends. Getting a fragment back indicates thatthe remote process has made progress digesting those sends. Inticket 1944, we see that a FIFO can also get congested from too manyreturning fragments. Further, with shared FIFOs, a FIFO couldbecome congested due to the activity of a third-party process.
In sum, getting a fragment back from a remote process is a poorindicator that it's time to retry pending sends.
Maybe the real way to know when to retry pending sends is just tocheck if there's room on the FIFO.
So, I'll try modifying MCA_BTL_SM_FIFO_WRITE. It'll start bychecking if there are pending sends. If so, it'll retry them beforeperforming the requested write. This should also help preserveordering a little better. I'm guessing this will not hurt ourmessage latency in any meaningful way, but I'll check this out.
Meanwhile, I wanted to check in with y'all for any guidance youmight have.
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] trac ticket 1944 and pending sends

Reply via email to