Okay - I see that; but it's independent of the code I've patched. Basically, if an RMPP is requested it won't be affected by this change, it will continue to use the existing algorithm - which appears to use a maximum 2 second time out for the first try of each segment (mad_rmpp.c).
Reviewing the code, under the existing code I think that if the segment is retried it will use the value that was provided for the caller, in the patched version it would be the caller's value plus the randomization algorithm. So, the risk I see is that total_timeout could expire while still sending rmpp packets, if each segment experiences time outs. On the subject of ignoring the value of the time out passed in by the caller - back in June we had talked about a model where the caller specified a total time to wait, regardless of how many retries are involved. I still think that idea has some merit; it still gives the developer some control over how long they will be waiting. -----Original Message----- From: Hefty, Sean [mailto:[email protected]] Sent: Wednesday, November 03, 2010 12:03 PM To: Mike Heinz; Hal Rosenstock Cc: [email protected]; Todd Rimmer Subject: RE: [PATCH v2] Add exponential backoff + random delay to MADs when retrying after timeout. > Hal asked the same thing - and I'm confused because I thought > that if receiving an RMPP response times out, the entire transaction is > aborted. RMPP still uses retries. If the user specifies a timeout of 1 second, with 3 retries, _each_ RMPP window will be retried up to 3 times, waiting for an ACK. Once an ACK is received, the next window can be retried up to 3 times, with a 1 second timeout per ACK, etc. It looks like your patch increments the timeout, and the increment is maintained across windows. > I'm open to this, but do we really need TCP/IP level congestion > control? How many nodes are likely to have more than a few SA > queries outstanding at a time? With large MPI job startup, we could have hundreds or thousands of SA queries issued from a single node. Even if the number of requests per node is small, the intent is to have all nodes back off from flooding the SA. So, I would say, yes, we want something like TCP congestion control. A delay in a response seems more likely to be a result in the SA being flooded with requests than an actual packet being dropped. This would also allow a node to delay sending any SA query after receiving a busy response to one. Caching data can help here, but we get the data from the SA first, plus still be able to handle errors, topology changes, QoS, etc. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
