Okay - I see that; but it's independent of the code I've patched. Basically, if 
an RMPP is requested it won't be affected by this change, it will continue to 
use the existing algorithm - which appears to use a maximum 2 second time out 
for the first try of each segment (mad_rmpp.c). 

Reviewing the code, under the existing code I think that if the segment is 
retried it will use the value that was provided for the caller, in the patched 
version it would be the caller's value plus the randomization algorithm. So, 
the risk I see is that total_timeout could expire while 
still sending rmpp packets, if each segment experiences time outs.

On the subject of ignoring the value of the time out passed in by the caller - 
back in June we had talked about a model where the caller specified a total 
time to wait, regardless of how many retries are involved. I still think that 
idea has some merit; it still gives the developer some control over how long 
they will be waiting.

-----Original Message-----
From: Hefty, Sean [mailto:[email protected]] 
Sent: Wednesday, November 03, 2010 12:03 PM
To: Mike Heinz; Hal Rosenstock
Cc: [email protected]; Todd Rimmer
Subject: RE: [PATCH v2] Add exponential backoff + random delay to MADs when 
retrying after timeout.

> Hal asked the same thing - and I'm confused because I thought
> that if receiving an RMPP response times out, the entire transaction is
> aborted.

RMPP still uses retries.  If the user specifies a timeout of 1 second, with 3 
retries, _each_ RMPP window will be retried up to 3 times, waiting for an ACK.  
Once an ACK is received, the next window can be retried up to 3 times, with a 1 
second timeout per ACK, etc.  It looks like your patch increments the timeout, 
and the increment is maintained across windows.

> I'm open to this, but do we really need TCP/IP level congestion
> control? How many nodes are likely to have more than a few SA
> queries outstanding at a time?

With large MPI job startup, we could have hundreds or thousands of SA queries 
issued from a single node.  Even if the number of requests per node is small, 
the intent is to have all nodes back off from flooding the SA.  So, I would 
say, yes, we want something like TCP congestion control.  A delay in a response 
seems more likely to be a result in the SA being flooded with requests than an 
actual packet being dropped.

This would also allow a node to delay sending any SA query after receiving a 
busy response to one.  Caching data can help here, but we get the data from the 
SA first, plus still be able to handle errors, topology changes, QoS, etc.

- Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to