Re: [OMPI devel] sm BTL flow management

Terry Dontje Fri, 26 Jun 2009 06:53:19 -0400

Eugene Loh wrote:

Brian W. Barrett wrote:
All -
Jeff, Eugene, and I had a long discussion this morning on the sm BTLflow management issues and came to a couple of conclusions.
* Jeff, Eugene, and I are all convinced that Eugene's addition ofpolling the receive queue to drain acks when sends start backing upis required for deadlock avoidance.
* We're also convinced that George's proposal, while a good idea ingeneral, is not sufficient. The send path doesn't appear tosufficiently progress the btl to avoid the deadlocks we're seeingwith the SM btl today. Therefore, while I still recommend sizing thefifo appropriately and limiting the freelist size, I think it's notsufficient to solve all problems.
* Finally, it took an hour, but we did determine one of the majordifferences between 1.2.8 and 1.3.0 in terms of sm is how messageswere pulled off the FIFO. In 1.2.8 (and all earlier versions), wereturn from btl_progress after a single message is received (ack ormessage) or the fifo was empty. In 1.3.0 (pre-srq work Eugene did),we changed to completely draining all queues before returning frombtl_progress. This has led to a situation where a single call tobtl_progress can make a large number of callbacks into the PML(900,000 times in one of Eugene's test case). The change was made toresolve an issue Terry was having with performance of a benchmark.We've decided that it would be adventageous to try something betweenthe two points and drain X number of messages from the queue, thenreturn, where X is 100 or so at most. This should cover theperformance issues Terry saw, but still not cause the huge number ofmessages added to the unexpected queue with a single call toMPI_Recv. Since a recv that is matched on the unexpected queuedoesn't result in a call to opal_progress, this should help balancethe load a little bit better. Eugene's going to take a stab atimplementing this short term.
I checked with Terry and we can't really recover the history here.Perhaps draining ACKs is good enough. After the first message, we canreturn.

Ok recovering history here, not sure it matters though. First theperformance issue George and I discussed and fixed is documented inthread http://www.open-mpi.org/community/lists/devel/2008/06/4158.phpAs was mentioned this was only to retrieve ack packets and should nothave any bearing on expanding the unexpected queue. The originalchange was r18724 and did not add line 432 mentioned below.

That's a one-line change. Just comment out line 432 ("gotorecheck_peer;") inhttps://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/sm/btl_sm_component.c#432.

Line 432 was introduced by r19309 to fix ticket #1378. Howeversomething is more at hand because since Eugene's experiement show'sremoving this line doesn't help reduce the amount of unexpecteds.

Problem is, that doesn't "fix" things. That is, my deadlock avoidancestuff (hg workspace on milliways that I sent out a pointer to) seemsto be enough to, well, avoid deadlock, but unexpected-message queuesare still growing like mad I think. Even when sm progress returnsafter the first message fragment is received. (X=1.) I think it'seven true if the max free-list size is capped at something small. I*think* (but am too tired to "know") that the issue is we poll theFIFO often anyhow. We have to for sends to reclaim fragments. Wehave to for receives, to pull out messages of interest. Maybe thingswould be better if we had one FIFO for in-coming fragments and anotherfor returning fragments. We could poll the latter only when we neededanother fragment for sending.

So is the issue that Eugene describing is that one rank is flooding theother with so many messages that the flooded victim cannot see theFRAG_ACKs without draining the real (flooding) messages from the FIFO first?

This seems like either having a separate FIFOs, as Eugene describesabove, or instituting some type of flow control (number of inflightmessages allowed) might help.


--td

But I'm under pressure to shift my attention to other activities. So,I think I'm going to abandon this effort. The flow control problemseems thorny. I can think of fixes as fast as I can identifyflow-control problems, but the rate of new flow-control problems justdoesn't seem to abate. Meanwhile, my unexpected-work queue growsunbounded. :^)
I think the combination of Euegene's deadlock avoidance fix and thecareful queue draining should make me comfortable enough to startanother round of testing, but at least explains the bottom line issues.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] sm BTL flow management

Reply via email to