Brian W. Barrett wrote:
All -
Jeff, Eugene, and I had a long discussion this morning on the sm BTL
flow management issues and came to a couple of conclusions.
* Jeff, Eugene, and I are all convinced that Eugene's addition of
polling the receive queue to drain acks when sends start backing up is
required for deadlock avoidance.
* We're also convinced that George's proposal, while a good idea in
general, is not sufficient. The send path doesn't appear to
sufficiently progress the btl to avoid the deadlocks we're seeing with
the SM btl today. Therefore, while I still recommend sizing the fifo
appropriately and limiting the freelist size, I think it's not
sufficient to solve all problems.
* Finally, it took an hour, but we did determine one of the major
differences between 1.2.8 and 1.3.0 in terms of sm is how messages
were pulled off the FIFO. In 1.2.8 (and all earlier versions), we
return from btl_progress after a single message is received (ack or
message) or the fifo was empty. In 1.3.0 (pre-srq work Eugene did),
we changed to completely draining all queues before returning from
btl_progress. This has led to a situation where a single call to
btl_progress can make a large number of callbacks into the PML
(900,000 times in one of Eugene's test case). The change was made to
resolve an issue Terry was having with performance of a benchmark.
We've decided that it would be adventageous to try something between
the two points and drain X number of messages from the queue, then
return, where X is 100 or so at most. This should cover the
performance issues Terry saw, but still not cause the huge number of
messages added to the unexpected queue with a single call to
MPI_Recv. Since a recv that is matched on the unexpected queue
doesn't result in a call to opal_progress, this should help balance
the load a little bit better. Eugene's going to take a stab at
implementing this short term.
I think the combination of Euegene's deadlock avoidance fix and the
careful queue draining should make me comfortable enough to start
another round of testing, but at least explains the bottom line issues.
Brian
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
IMHO, one should never process an unbounded number of elements from any
FIFO/socket/CQ/etc. because doing so risks starving other channels (some
of which might not exist yet at the time the work-without-bound code is
written). So, I think Brian's proposal (drain <= X; for 1 < X < inf) is
the correct approach, regardless of any of the other present concerns
w.r.t the sm blt.
In my own non-MPI experience, I have found that selection of such an X
is usually not a big deal - just find a value large enough to
effectively hide the cost of "entry" (analogy: if you hold a mutex the
critical section should be dominated by the work "inside", not the cost
of the lock/unlock operations). Once X is big enough that "entry" is
nominally free, then the type of performance issues I suspect Terry was
seeing will fade away. Beyond that point, further increases in X bring
rapidly diminishing returns in my experience, and risk starving some
other code path.
crude heuristic: start at X=2 and keep doubling it until performance of
the benchmark that concerned Terry are within a standard deviation
(difference is "in the noise") at X and X*2 (or within some other
tolerance of ones choice ). Then, of course, use the lower value, X
(not X*2).
-Paul
P.S. If there are other code paths that process elements without bound,
they probably deserve some scrutiny while this idea is fresh on people's
minds.
--
Paul H. Hargrove phhargr...@lbl.gov
Future Technologies Group Tel: +1-510-495-2352
HPC Research Department Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory