Re: [OMPI devel] sm BTL flow management

George Bosilca Fri, 26 Jun 2009 11:59:50 -0400

As Terry described and based on the patch attached to the ticket ontrac, the extra goto has slipped in the commit by mistake. It belongsto a totally different patch for shared memory I'm working on. I'llremove it.


  george.


On Jun 26, 2009, at 06:52 , Terry Dontje wrote:

Eugene Loh wrote:
Brian W. Barrett wrote:
All -
Jeff, Eugene, and I had a long discussion this morning on the smBTL flow management issues and came to a couple of conclusions.
* Jeff, Eugene, and I are all convinced that Eugene's addition ofpolling the receive queue to drain acks when sends start backingup is required for deadlock avoidance.
* We're also convinced that George's proposal, while a good ideain general, is not sufficient. The send path doesn't appear tosufficiently progress the btl to avoid the deadlocks we're seeingwith the SM btl today. Therefore, while I still recommend sizingthe fifo appropriately and limiting the freelist size, I thinkit's not sufficient to solve all problems.
* Finally, it took an hour, but we did determine one of the majordifferences between 1.2.8 and 1.3.0 in terms of sm is how messageswere pulled off the FIFO. In 1.2.8 (and all earlier versions), wereturn from btl_progress after a single message is received (ackor message) or the fifo was empty. In 1.3.0 (pre-srq work Eugenedid), we changed to completely draining all queues beforereturning from btl_progress. This has led to a situation where asingle call to btl_progress can make a large number of callbacksinto the PML (900,000 times in one of Eugene's test case). Thechange was made to resolve an issue Terry was having withperformance of a benchmark. We've decided that it would beadventageous to try something between the two points and drain Xnumber of messages from the queue, then return, where X is 100 orso at most. This should cover the performance issues Terry saw,but still not cause the huge number of messages added to theunexpected queue with a single call to MPI_Recv. Since a recvthat is matched on the unexpected queue doesn't result in a callto opal_progress, this should help balance the load a little bitbetter. Eugene's going to take a stab at implementing this shortterm.
I checked with Terry and we can't really recover the history here.Perhaps draining ACKs is good enough. After the first message, wecan return.
Ok recovering history here, not sure it matters though. First theperformance issue George and I discussed and fixed is documented inthread http://www.open-mpi.org/community/lists/devel/2008/06/4158.phpAs was mentioned this was only to retrieve ack packets and shouldnot have any bearing on expanding the unexpected queue. Theoriginal change was r18724 and did not add line 432 mentioned below.
That's a one-line change. Just comment out line 432 ("gotorecheck_peer;") in https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/sm/btl_sm_component.c#432 .
Line 432 was introduced by r19309 to fix ticket #1378. Howeversomething is more at hand because since Eugene's experiement show'sremoving this line doesn't help reduce the amount of unexpecteds.
Problem is, that doesn't "fix" things. That is, my deadlockavoidance stuff (hg workspace on milliways that I sent out apointer to) seems to be enough to, well, avoid deadlock, butunexpected-message queues are still growing like mad I think. Evenwhen sm progress returns after the first message fragment isreceived. (X=1.) I think it's even true if the max free-list sizeis capped at something small. I *think* (but am too tired to"know") that the issue is we poll the FIFO often anyhow. We haveto for sends to reclaim fragments. We have to for receives, topull out messages of interest. Maybe things would be better if wehad one FIFO for in-coming fragments and another for returningfragments. We could poll the latter only when we needed anotherfragment for sending.
So is the issue that Eugene describing is that one rank is floodingthe other with so many messages that the flooded victim cannot seethe FRAG_ACKs without draining the real (flooding) messages from theFIFO first?
This seems like either having a separate FIFOs, as Eugene describesabove, or instituting some type of flow control (number of inflightmessages allowed) might help.
--td
But I'm under pressure to shift my attention to other activities.So, I think I'm going to abandon this effort. The flow controlproblem seems thorny. I can think of fixes as fast as I canidentify flow-control problems, but the rate of new flow-controlproblems just doesn't seem to abate. Meanwhile, my unexpected-workqueue grows unbounded. :^)
I think the combination of Euegene's deadlock avoidance fix andthe careful queue draining should make me comfortable enough tostart another round of testing, but at least explains the bottomline issues.
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] sm BTL flow management

Reply via email to