Or go to what I proposed and USE A LINKED LIST! (as I said before,
not an original idea, but one I think has merit) Then you don't have
to size the fifo, because there isn't a fifo. Limit the number of
send fragments any one proc can allocate and the only place memory can
grow without bound is the OB1 unexpected list. Then use SEND_COMPLETE
instead of SEND_NORMAL in the collectives without barrier semantics
(bcast, reduce, gather, scatter) and you effectively limit how far
ahead any one proc can get to something that we can handle, with no
performance hit.
Brian
On Jun 24, 2009, at 12:46 AM, George Bosilca wrote:
In other words, as long as a queue is peer based (peer not peers),
the management of the pending send list was doing what it was
supposed to, and there was no possibility of deadlock. With the new
code, as a third party can fill up a remote queue, getting a
fragment back [as you stated] became a poor indicator for retry.
I don't see how the proposed solution will solve the issue without a
significant overhead. As we only call the MCA_BTL_SM_FIFO_WRITE once
before the fragment get into the pending list, reordering the
fragments will not solve the issue. When the peers is overloaded,
the fragments will end-up in the pending list, and there is nothing
to get it out of there except a message from the peer. In some
cases, such a message might never be delivered, simply because the
peer doesn't have any data to send us.
The other solution is to always check all pending lists. While this
might work, it will certainly add undesirable overhead to the send
path.
You last patch was doing the right thing. Globally decreasing the
size of the memory used by the MPI library is _the right_ way to go.
Unfortunately, your patch only address this at the level of the
shared memory file. Now, instead of using less memory we use even
more because we have to store that data somewhere ... in the
fragments returned by the btl_sm_alloc function. These fragments are
allocated on demand and by default there is no limit to the number
of such fragments.
Here is a simple fix for both problems. Enforce a reasonable limit
on the number of fragments in the BTL free list (1K should be more
than enough), and make sure the fifo has a size equal to p *
number_of_allowed_fragments_in_the_free_list, where p is the number
of local processes. While this solution will certainly increase
again the size of the mapped file, it will do it by a small margin
compared with what is happening today in the code. This is without
talking about the fact that it will solve the deadlock problem, by
removing the inability to return a fragment. In addition, the PML is
capable of handing such situations, so we're getting back to a
deadlock free sm BTL.
george.
On Jun 23, 2009, at 11:04 , Eugene Loh wrote:
The sm BTL used to have two mechanisms for dealing with congested
FIFOs. One was to grow the FIFOs. Another was to queue pending
sends locally (on the sender's side). I think the grow-FIFO
mechanism was typically invoked and the pending-send mechanism used
only under extreme circumstances (no more memory).
With the sm makeover of 1.3.2, we dropped the ability to grow
FIFOs. The code added complexity and there seemed to be no need to
have two mechanisms to deal with congested FIFOs. In ticket 1944,
however, we see that repeated collectives can produce hangs, and
this seems to be due to the pending-send code not adequately
dealing with congested FIFOs.
Today, when a process tries to write to a remote FIFO and fails, it
queues the write as a pending send. The only condition under which
it retries pending sends is when it gets a fragment back from a
remote process.
I think the logic must have been that the FIFO got congested
because we issued too many sends. Getting a fragment back
indicates that the remote process has made progress digesting those
sends. In ticket 1944, we see that a FIFO can also get congested
from too many returning fragments. Further, with shared FIFOs, a
FIFO could become congested due to the activity of a third-party
process.
In sum, getting a fragment back from a remote process is a poor
indicator that it's time to retry pending sends.
Maybe the real way to know when to retry pending sends is just to
check if there's room on the FIFO.
So, I'll try modifying MCA_BTL_SM_FIFO_WRITE. It'll start by
checking if there are pending sends. If so, it'll retry them
before performing the requested write. This should also help
preserve ordering a little better. I'm guessing this will not hurt
our message latency in any meaningful way, but I'll check this out.
Meanwhile, I wanted to check in with y'all for any guidance you
might have.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel