Re: [OMPI devel] trac ticket 1944 and pending sends

Brian Barrett Wed, 24 Jun 2009 08:50:35 -0400

Or go to what I proposed and USE A LINKED LIST! (as I said before,not an original idea, but one I think has merit) Then you don't haveto size the fifo, because there isn't a fifo. Limit the number ofsend fragments any one proc can allocate and the only place memory cangrow without bound is the OB1 unexpected list. Then use SEND_COMPLETEinstead of SEND_NORMAL in the collectives without barrier semantics(bcast, reduce, gather, scatter) and you effectively limit how farahead any one proc can get to something that we can handle, with noperformance hit.


Brian


On Jun 24, 2009, at 12:46 AM, George Bosilca wrote:

In other words, as long as a queue is peer based (peer not peers),the management of the pending send list was doing what it wassupposed to, and there was no possibility of deadlock. With the newcode, as a third party can fill up a remote queue, getting afragment back [as you stated] became a poor indicator for retry.
I don't see how the proposed solution will solve the issue without asignificant overhead. As we only call the MCA_BTL_SM_FIFO_WRITE oncebefore the fragment get into the pending list, reordering thefragments will not solve the issue. When the peers is overloaded,the fragments will end-up in the pending list, and there is nothingto get it out of there except a message from the peer. In somecases, such a message might never be delivered, simply because thepeer doesn't have any data to send us.
The other solution is to always check all pending lists. While thismight work, it will certainly add undesirable overhead to the sendpath.
You last patch was doing the right thing. Globally decreasing thesize of the memory used by the MPI library is _the right_ way to go.Unfortunately, your patch only address this at the level of theshared memory file. Now, instead of using less memory we use evenmore because we have to store that data somewhere ... in thefragments returned by the btl_sm_alloc function. These fragments areallocated on demand and by default there is no limit to the numberof such fragments.
Here is a simple fix for both problems. Enforce a reasonable limiton the number of fragments in the BTL free list (1K should be morethan enough), and make sure the fifo has a size equal to p *number_of_allowed_fragments_in_the_free_list, where p is the numberof local processes. While this solution will certainly increaseagain the size of the mapped file, it will do it by a small margincompared with what is happening today in the code. This is withouttalking about the fact that it will solve the deadlock problem, byremoving the inability to return a fragment. In addition, the PML iscapable of handing such situations, so we're getting back to adeadlock free sm BTL.
 george.


On Jun 23, 2009, at 11:04 , Eugene Loh wrote:
The sm BTL used to have two mechanisms for dealing with congestedFIFOs. One was to grow the FIFOs. Another was to queue pendingsends locally (on the sender's side). I think the grow-FIFOmechanism was typically invoked and the pending-send mechanism usedonly under extreme circumstances (no more memory).
With the sm makeover of 1.3.2, we dropped the ability to growFIFOs. The code added complexity and there seemed to be no need tohave two mechanisms to deal with congested FIFOs. In ticket 1944,however, we see that repeated collectives can produce hangs, andthis seems to be due to the pending-send code not adequatelydealing with congested FIFOs.
Today, when a process tries to write to a remote FIFO and fails, itqueues the write as a pending send. The only condition under whichit retries pending sends is when it gets a fragment back from aremote process.
I think the logic must have been that the FIFO got congestedbecause we issued too many sends. Getting a fragment backindicates that the remote process has made progress digesting thosesends. In ticket 1944, we see that a FIFO can also get congestedfrom too many returning fragments. Further, with shared FIFOs, aFIFO could become congested due to the activity of a third-partyprocess.
In sum, getting a fragment back from a remote process is a poorindicator that it's time to retry pending sends.
Maybe the real way to know when to retry pending sends is just tocheck if there's room on the FIFO.
So, I'll try modifying MCA_BTL_SM_FIFO_WRITE. It'll start bychecking if there are pending sends. If so, it'll retry thembefore performing the requested write. This should also helppreserve ordering a little better. I'm guessing this will not hurtour message latency in any meaningful way, but I'll check this out.
Meanwhile, I wanted to check in with y'all for any guidance youmight have.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] trac ticket 1944 and pending sends

Reply via email to