George Bosilca wrote:
In other words, as long as a queue is peer based (peer not peers),
the management of the pending send list was doing what it was
supposed to, and there was no possibility of deadlock.
I disagree. It is true that I can fill up a remote FIFO with sends. In
such a case, when the remote process receives a fragment and returns it
to me, I have an indication that the remote FIFO has cleared up a little
bit and I can retry a pending send. But even for dedicated FIFOs, this
is not the only possibility. It is also possible that I have filled the
remote FIFO up with fragments I was returning to the remote process. In
that case, the remote process can drain its FIFO without my getting
anything back. And, that's what was happening in my test case.
Specifically, my test4.c was np=2; there were no third-party processes
to interfere. The remote FIFO was congested, I posted a pending send,
the remote FIFO cleared up, and I never knew to retry a pending send.
Broken.
With the new code, as a third party can fill up a remote queue,
getting a fragment back [as you stated] became a poor indicator for
retry.
I agree that shared queues add another dimension to the problem.
I don't see how the proposed solution will solve the issue without a
significant overhead. As we only call the MCA_BTL_SM_FIFO_WRITE once
before the fragment get into the pending list, reordering the
fragments will not solve the issue. When the peers is overloaded, the
fragments will end-up in the pending list, and there is nothing to
get it out of there except a message from the peer. In some cases,
such a message might never be delivered, simply because the peer
doesn't have any data to send us.
The other solution is to always check all pending lists. While this
might work, it will certainly add undesirable overhead to the send path.
The approach I was working on was twofold:
1) Even if I'm only sending messages, occasionally I should check my
in-bound FIFO for returning fragments. That's mom-and-apple-pie stuff,
right? I *NEED* something like this if I'm going to support unilateral
sends (a process generating lots of sends with no corresponding
receives). The overhead can be managed since we poll the FIFO only
"occasionally".
2) I need to retry pending sends more aggressively -- even if there are
dedicated FIFOs and certainly if there are shared FIFOs. I don't need
to check *all* pending queues. I can keep a counter of all pending
sends. If there are none (typical case), this should be a quick check.
If there are some, I can do the more expensive work of finding which
queue has the pending sends.
You last patch was doing the right thing. Globally decreasing the
size of the memory used by the MPI library is _the right_ way to go.
Unfortunately, your patch only address this at the level of the
shared memory file. Now, instead of using less memory we use even
more because we have to store that data somewhere ... in the
fragments returned by the btl_sm_alloc function. These fragments are
allocated on demand and by default there is no limit to the number of
such fragments.
Here is a simple fix for both problems. Enforce a reasonable limit on
the number of fragments in the BTL free list (1K should be more than
enough), and make sure the fifo has a size equal to p *
number_of_allowed_fragments_in_the_free_list, where p is the number
of local processes. While this solution will certainly increase again
the size of the mapped file, it will do it by a small margin compared
with what is happening today in the code. This is without talking
about the fact that it will solve the deadlock problem, by removing
the inability to return a fragment. In addition, the PML is capable
of handing such situations, so we're getting back to a deadlock free
sm BTL.
I'm open to this approach. How do you come up with your "reasonable
limit on the number of fragments"? E.g., should it depend on the number
of peers? 1K sounds generous for np=2, but less so for np=512.
I don't see how the overall memory consumption will be reduced. We push
the problem from the shared memory area to the BTL's pending sends and
now to the PML's pending sends. The fact remains that if the
application is stuffing a lot of messages into the system, either MPI
has to buffer them or the application sees less progress. The only
exception is if OMPI is not reclaiming returned fragments, but we need
to fix that problem anyhow.
Still, I like the solution because it pushes this problem up to the PML
(which is not my responsibility!). It makes sense to manage all these
issues in one place (like the PML) rather than in multiple places.
Further, it appears that the PML is doing The Right Thing today
(retrying pending sends aggressively and calling the progress engine
when sends stall).
I'll play around with your proposal. I like it.
On Jun 23, 2009, at 11:04 , Eugene Loh wrote:
The sm BTL used to have two mechanisms for dealing with congested
FIFOs. One was to grow the FIFOs. Another was to queue pending
sends locally (on the sender's side). I think the grow-FIFO
mechanism was typically invoked and the pending-send mechanism used
only under extreme circumstances (no more memory).
With the sm makeover of 1.3.2, we dropped the ability to grow
FIFOs. The code added complexity and there seemed to be no need to
have two mechanisms to deal with congested FIFOs. In ticket 1944,
however, we see that repeated collectives can produce hangs, and
this seems to be due to the pending-send code not adequately dealing
with congested FIFOs.
Today, when a process tries to write to a remote FIFO and fails, it
queues the write as a pending send. The only condition under which
it retries pending sends is when it gets a fragment back from a
remote process.
I think the logic must have been that the FIFO got congested because
we issued too many sends. Getting a fragment back indicates that
the remote process has made progress digesting those sends. In
ticket 1944, we see that a FIFO can also get congested from too many
returning fragments. Further, with shared FIFOs, a FIFO could
become congested due to the activity of a third-party process.
In sum, getting a fragment back from a remote process is a poor
indicator that it's time to retry pending sends.
Maybe the real way to know when to retry pending sends is just to
check if there's room on the FIFO.
So, I'll try modifying MCA_BTL_SM_FIFO_WRITE. It'll start by
checking if there are pending sends. If so, it'll retry them before
performing the requested write. This should also help preserve
ordering a little better. I'm guessing this will not hurt our
message latency in any meaningful way, but I'll check this out.
Meanwhile, I wanted to check in with y'all for any guidance you
might have.