I'm afraid that this solution doesn't pass the acid test - our reproducers still lock up if we set the #frags to 1K and fifo size to p*that. In other words, adding:
-mca btl_sm_free_list_max 1024 -mca btl_sm_fifo_size p*1024 where p=ppn still causes our reproducers to hang. Sorry....sigh. *From: *George Bosilca <bosi...@eecs.utk.edu> > *Date: *June 24, 2009 12:46:28 AM MDT > *To: *Open MPI Developers <de...@open-mpi.org> > *Subject: **Re: [OMPI devel] trac ticket 1944 and pending sends* > *Reply-To: *Open MPI Developers <de...@open-mpi.org> > > In other words, as long as a queue is peer based (peer not peers), the > management of the pending send list was doing what it was supposed to, and > there was no possibility of deadlock. With the new code, as a third party > can fill up a remote queue, getting a fragment back [as you stated] became a > poor indicator for retry. > > I don't see how the proposed solution will solve the issue without a > significant overhead. As we only call the MCA_BTL_SM_FIFO_WRITE once before > the fragment get into the pending list, reordering the fragments will not > solve the issue. When the peers is overloaded, the fragments will end-up in > the pending list, and there is nothing to get it out of there except a > message from the peer. In some cases, such a message might never be > delivered, simply because the peer doesn't have any data to send us. > > The other solution is to always check all pending lists. While this might > work, it will certainly add undesirable overhead to the send path. > > You last patch was doing the right thing. Globally decreasing the size of > the memory used by the MPI library is _the right_ way to go. Unfortunately, > your patch only address this at the level of the shared memory file. Now, > instead of using less memory we use even more because we have to store that > data somewhere ... in the fragments returned by the btl_sm_alloc function. > These fragments are allocated on demand and by default there is no limit to > the number of such fragments. > > Here is a simple fix for both problems. Enforce a reasonable limit on the > number of fragments in the BTL free list (1K should be more than enough), > and make sure the fifo has a size equal to p * > number_of_allowed_fragments_in_the_free_list, where p is the number of local > processes. While this solution will certainly increase again the size of the > mapped file, it will do it by a small margin compared with what is happening > today in the code. This is without talking about the fact that it will solve > the deadlock problem, by removing the inability to return a fragment. In > addition, the PML is capable of handing such situations, so we're getting > back to a deadlock free sm BTL. > > george. > > > On Jun 23, 2009, at 11:04 , Eugene Loh wrote: > > The sm BTL used to have two mechanisms for dealing with congested FIFOs. > One was to grow the FIFOs. Another was to queue pending sends locally (on > the sender's side). I think the grow-FIFO mechanism was typically invoked > and the pending-send mechanism used only under extreme circumstances (no > more memory). > > > With the sm makeover of 1.3.2, we dropped the ability to grow FIFOs. The > code added complexity and there seemed to be no need to have two mechanisms > to deal with congested FIFOs. In ticket 1944, however, we see that repeated > collectives can produce hangs, and this seems to be due to the pending-send > code not adequately dealing with congested FIFOs. > > > Today, when a process tries to write to a remote FIFO and fails, it queues > the write as a pending send. The only condition under which it retries > pending sends is when it gets a fragment back from a remote process. > > > I think the logic must have been that the FIFO got congested because we > issued too many sends. Getting a fragment back indicates that the remote > process has made progress digesting those sends. In ticket 1944, we see > that a FIFO can also get congested from too many returning fragments. > Further, with shared FIFOs, a FIFO could become congested due to the > activity of a third-party process. > > > In sum, getting a fragment back from a remote process is a poor indicator > that it's time to retry pending sends. > > > Maybe the real way to know when to retry pending sends is just to check if > there's room on the FIFO. > > > So, I'll try modifying MCA_BTL_SM_FIFO_WRITE. It'll start by checking if > there are pending sends. If so, it'll retry them before performing the > requested write. This should also help preserve ordering a little better. > I'm guessing this will not hurt our message latency in any meaningful way, > but I'll check this out. > > > Meanwhile, I wanted to check in with y'all for any guidance you might have. > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >