All -
Jeff, Eugene, and I had a long discussion this morning on the sm BTL flow
management issues and came to a couple of conclusions.
* Jeff, Eugene, and I are all convinced that Eugene's addition of polling
the receive queue to drain acks when sends start backing up is required
for deadlock avoidance.
* We're also convinced that George's proposal, while a good idea in
general, is not sufficient. The send path doesn't appear to sufficiently
progress the btl to avoid the deadlocks we're seeing with the SM btl
today. Therefore, while I still recommend sizing the fifo appropriately
and limiting the freelist size, I think it's not sufficient to solve all
problems.
* Finally, it took an hour, but we did determine one of the major
differences between 1.2.8 and 1.3.0 in terms of sm is how messages were
pulled off the FIFO. In 1.2.8 (and all earlier versions), we return from
btl_progress after a single message is received (ack or message) or the
fifo was empty. In 1.3.0 (pre-srq work Eugene did), we changed to
completely draining all queues before returning from btl_progress. This
has led to a situation where a single call to btl_progress can make a
large number of callbacks into the PML (900,000 times in one of Eugene's
test case). The change was made to resolve an issue Terry was having with
performance of a benchmark. We've decided that it would be adventageous
to try something between the two points and drain X number of messages
from the queue, then return, where X is 100 or so at most. This should
cover the performance issues Terry saw, but still not cause the huge
number of messages added to the unexpected queue with a single call to
MPI_Recv. Since a recv that is matched on the unexpected queue doesn't
result in a call to opal_progress, this should help balance the load a
little bit better. Eugene's going to take a stab at implementing this
short term.
I think the combination of Euegene's deadlock avoidance fix and the
careful queue draining should make me comfortable enough to start another
round of testing, but at least explains the bottom line issues.
Brian