On May 19, 2008, at 4:44 PM, Steve Wise wrote:
1. Posting more at low watermark can lead to DoS-like behavior when
you have a fast sender and a slow receiver. This is exactly the
resource-exhaustion kind of behavior that a high quality MPI
implementation is supposed to avoid -- we really should to throttle
the sender somehow.
2. Resending ad infinitum simply eats up more bandwidth and takes
away
network resources (e.g., switch resources) that other, legitimate
traffic. Particularly if the receiver doesn't dip into the MPI layer
for many hours. So yes, it *works*, but it's definitely sub-optimal.
The SRQ low water mark is simply an API method to allow applications
to
try and never hit the "we're totally out recv bufs" problem. That's a
tool that I think is needed for srq users no matter what flow control
method you use to try and avoid jeff's #1 item above.
If you had these buffers available, why didn't you post them when the
QP was created / this sender was added?
This mechanism *might* make sense if there was a sensible approach to
know when to remove the "additional" buffers posted to an SRQ due to
bursty traffic. But how do you know when that is?
And if you don't like RNR retry/TCP retrans approach, which is bad for
reason #2 (and because TCP will eventually give up and reset the
connection), then I think there needs to be some OMPI layer
protocol to
stop senders that are abusing the SRQ pool for whatever reason (too
fast
of a sender, sleeping beauty receiver never entering OMPI layer,
whtaever).
That implies a progress thread. If/when we add a progress thread, it
will likely be for progressing long messages. Myricom and MVAPICH
have shown that rapidly firing progress threads and problematic to
performance. But even if you have that progress thread *only* wake up
on the low watermark for the SRQ, you have two problems:
- there still could be many inbound messages that will overflow the
SRQ and/or even more could be inbound by the time your STOP message
gets to everyone (gets even worse as the MPI job scales up in total
number of processes)
- in the case of a very large MPI job, sending the STOP message has
obvious scalability problems (have to send it to everyone, which
requires its own set of send buffers and WQEs/CQEs)
--
Jeff Squyres
Cisco Systems