Re: [OMPI devel] sm BTL flow management

Paul H. Hargrove Thu, 25 Jun 2009 16:47:06 -0400

Brian W. Barrett wrote:

All -
Jeff, Eugene, and I had a long discussion this morning on the sm BTLflow management issues and came to a couple of conclusions.
* Jeff, Eugene, and I are all convinced that Eugene's addition ofpolling the receive queue to drain acks when sends start backing up isrequired for deadlock avoidance.
* We're also convinced that George's proposal, while a good idea ingeneral, is not sufficient. The send path doesn't appear tosufficiently progress the btl to avoid the deadlocks we're seeing withthe SM btl today. Therefore, while I still recommend sizing the fifoappropriately and limiting the freelist size, I think it's notsufficient to solve all problems.
* Finally, it took an hour, but we did determine one of the majordifferences between 1.2.8 and 1.3.0 in terms of sm is how messageswere pulled off the FIFO. In 1.2.8 (and all earlier versions), wereturn from btl_progress after a single message is received (ack ormessage) or the fifo was empty. In 1.3.0 (pre-srq work Eugene did),we changed to completely draining all queues before returning frombtl_progress. This has led to a situation where a single call tobtl_progress can make a large number of callbacks into the PML(900,000 times in one of Eugene's test case). The change was made toresolve an issue Terry was having with performance of a benchmark.We've decided that it would be adventageous to try something betweenthe two points and drain X number of messages from the queue, thenreturn, where X is 100 or so at most. This should cover theperformance issues Terry saw, but still not cause the huge number ofmessages added to the unexpected queue with a single call toMPI_Recv. Since a recv that is matched on the unexpected queuedoesn't result in a call to opal_progress, this should help balancethe load a little bit better. Eugene's going to take a stab atimplementing this short term.
I think the combination of Euegene's deadlock avoidance fix and thecareful queue draining should make me comfortable enough to startanother round of testing, but at least explains the bottom line issues.
Brian
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

IMHO, one should never process an unbounded number of elements from anyFIFO/socket/CQ/etc. because doing so risks starving other channels (someof which might not exist yet at the time the work-without-bound code iswritten). So, I think Brian's proposal (drain <= X; for 1 < X < inf) isthe correct approach, regardless of any of the other present concernsw.r.t the sm blt.

In my own non-MPI experience, I have found that selection of such an Xis usually not a big deal - just find a value large enough toeffectively hide the cost of "entry" (analogy: if you hold a mutex thecritical section should be dominated by the work "inside", not the costof the lock/unlock operations). Once X is big enough that "entry" isnominally free, then the type of performance issues I suspect Terry wasseeing will fade away. Beyond that point, further increases in X bringrapidly diminishing returns in my experience, and risk starving someother code path.

crude heuristic: start at X=2 and keep doubling it until performance ofthe benchmark that concerned Terry are within a standard deviation(difference is "in the noise") at X and X*2 (or within some othertolerance of ones choice ). Then, of course, use the lower value, X(not X*2).


-Paul

P.S. If there are other code paths that process elements without bound,they probably deserve some scrutiny while this idea is fresh on people'sminds.


--
Paul H. Hargrove                          phhargr...@lbl.gov
Future Technologies Group                 Tel: +1-510-495-2352
HPC Research Department                   Fax: +1-510-486-6900

Lawrence Berkeley National Laboratory

Re: [OMPI devel] sm BTL flow management

Reply via email to