Re: [OMPI devel] Threaded progress for CPCs

Jeff Squyres Mon, 19 May 2008 16:59:15 -0400

On May 19, 2008, at 4:44 PM, Steve Wise wrote:

1. Posting more at low watermark can lead to DoS-like behavior when
you have a fast sender and a slow receiver.  This is exactly the
resource-exhaustion kind of behavior that a high quality MPI
implementation is supposed to avoid -- we really should to throttle
the sender somehow.

2. Resending ad infinitum simply eats up more bandwidth and takesaway

network resources (e.g., switch resources) that other, legitimate
traffic.  Particularly if the receiver doesn't dip into the MPI layer
for many hours.  So yes, it *works*, but it's definitely sub-optimal.

The SRQ low water mark is simply an API method to allow applicationsto

try and never hit the "we're totally out recv bufs" problem.  That's a
tool that I think is needed for srq users no matter what flow control
method you use to try and avoid jeff's #1 item above.

If you had these buffers available, why didn't you post them when theQP was created / this sender was added?

This mechanism *might* make sense if there was a sensible approach toknow when to remove the "additional" buffers posted to an SRQ due tobursty traffic. But how do you know when that is?

And if you don't like RNR retry/TCP retrans approach, which is bad for
reason #2 (and because TCP will eventually give up and reset the
connection), then I think there needs to be some OMPI layerprotocol tostop senders that are abusing the SRQ pool for whatever reason (toofastof a sender, sleeping beauty receiver never entering OMPI layer,whtaever).

That implies a progress thread. If/when we add a progress thread, itwill likely be for progressing long messages. Myricom and MVAPICHhave shown that rapidly firing progress threads and problematic toperformance. But even if you have that progress thread *only* wake upon the low watermark for the SRQ, you have two problems:

- there still could be many inbound messages that will overflow theSRQ and/or even more could be inbound by the time your STOP messagegets to everyone (gets even worse as the MPI job scales up in totalnumber of processes)

- in the case of a very large MPI job, sending the STOP message hasobvious scalability problems (have to send it to everyone, whichrequires its own set of send buffers and WQEs/CQEs)


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] Threaded progress for CPCs

Reply via email to