Eugene Loh wrote:
Brian W. Barrett wrote:
All -
Jeff, Eugene, and I had a long discussion this morning on the sm BTL
flow management issues and came to a couple of conclusions.
* Jeff, Eugene, and I are all convinced that Eugene's addition of
polling the receive queue to drain acks when sends start backing up
is required for deadlock avoidance.
* We're also convinced that George's proposal, while a good idea in
general, is not sufficient. The send path doesn't appear to
sufficiently progress the btl to avoid the deadlocks we're seeing
with the SM btl today. Therefore, while I still recommend sizing the
fifo appropriately and limiting the freelist size, I think it's not
sufficient to solve all problems.
* Finally, it took an hour, but we did determine one of the major
differences between 1.2.8 and 1.3.0 in terms of sm is how messages
were pulled off the FIFO. In 1.2.8 (and all earlier versions), we
return from btl_progress after a single message is received (ack or
message) or the fifo was empty. In 1.3.0 (pre-srq work Eugene did),
we changed to completely draining all queues before returning from
btl_progress. This has led to a situation where a single call to
btl_progress can make a large number of callbacks into the PML
(900,000 times in one of Eugene's test case). The change was made to
resolve an issue Terry was having with performance of a benchmark.
We've decided that it would be adventageous to try something between
the two points and drain X number of messages from the queue, then
return, where X is 100 or so at most. This should cover the
performance issues Terry saw, but still not cause the huge number of
messages added to the unexpected queue with a single call to
MPI_Recv. Since a recv that is matched on the unexpected queue
doesn't result in a call to opal_progress, this should help balance
the load a little bit better. Eugene's going to take a stab at
implementing this short term.
I checked with Terry and we can't really recover the history here.
Perhaps draining ACKs is good enough. After the first message, we can
return.
Ok recovering history here, not sure it matters though. First the
performance issue George and I discussed and fixed is documented in
thread http://www.open-mpi.org/community/lists/devel/2008/06/4158.php
As was mentioned this was only to retrieve ack packets and should not
have any bearing on expanding the unexpected queue. The original
change was r18724 and did not add line 432 mentioned below.
That's a one-line change. Just comment out line 432 ("goto
recheck_peer;") in
https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/sm/btl_sm_component.c#432
.
Line 432 was introduced by r19309 to fix ticket #1378. However
something is more at hand because since Eugene's experiement show's
removing this line doesn't help reduce the amount of unexpecteds.
Problem is, that doesn't "fix" things. That is, my deadlock avoidance
stuff (hg workspace on milliways that I sent out a pointer to) seems
to be enough to, well, avoid deadlock, but unexpected-message queues
are still growing like mad I think. Even when sm progress returns
after the first message fragment is received. (X=1.) I think it's
even true if the max free-list size is capped at something small. I
*think* (but am too tired to "know") that the issue is we poll the
FIFO often anyhow. We have to for sends to reclaim fragments. We
have to for receives, to pull out messages of interest. Maybe things
would be better if we had one FIFO for in-coming fragments and another
for returning fragments. We could poll the latter only when we needed
another fragment for sending.
So is the issue that Eugene describing is that one rank is flooding the
other with so many messages that the flooded victim cannot see the
FRAG_ACKs without draining the real (flooding) messages from the FIFO first?
This seems like either having a separate FIFOs, as Eugene describes
above, or instituting some type of flow control (number of inflight
messages allowed) might help.
--td
But I'm under pressure to shift my attention to other activities. So,
I think I'm going to abandon this effort. The flow control problem
seems thorny. I can think of fixes as fast as I can identify
flow-control problems, but the rate of new flow-control problems just
doesn't seem to abate. Meanwhile, my unexpected-work queue grows
unbounded. :^)
I think the combination of Euegene's deadlock avoidance fix and the
careful queue draining should make me comfortable enough to start
another round of testing, but at least explains the bottom line issues.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel