> The problem with this approach is that if the same application or ulp is > installed on many hundreds (or thousands) of nodes, all using the same > retry > interval, they could all end up retrying at roughly the same time, causing > repeatable packet storms. On a large cluster, these storms can effectively > act > as a denial of service attack. To get around this, the retry timer should > have > a randomization component of a similar order of magnitude as the retries > themselves. Since retries are usually on the order of one second, the patch > defines the randomization component as between zero and roughly 1/2 second > (511 ms) although the upper limit can tuned by changing a #define. > > The other standard method for prevent storms of retries is to implement an > exponential backoff, such as is used in the Ethernet protocol. However, > because > the user has also explicitly specified a timeout value, I chose to treat > that value as a minimum delay, then I add an exponential value on top of > that, > defined as BASE*2^c, where 'c' is the number of retries already attempted, > minus 1. > > Currently, the base value is defined as 511 ms (1/2 second), so that the > retry interval is defined as: > > (minimum timeout) + 511<<c - (random value between 0 & 511) > > This causes the following retry times: > > 0: minimum timeout > 1: minimum timeout + (random value between 0 & 511) > 2: minimum timeout + 1 second - (random value between 0 & 511) > 3: minimum timeout + 2 seconds - (random value between 0 & 511) > 4: minimum timeout + 4 seconds - (random value between 0 & 511)
When you consider RMPP, the timeout/retry values specified by the user are not straightforward in their meaning. I haven't look at this patch in detail yet, but how do the timeout changes work with RMPP MADs? Is the timeout reset to the minimum after an ACK is received? My personal preference at this time is to push more intelligence into the timeout/retry algorithm used by the MAD layer, but restricted to SA clients. I'd like to see even more randomization in the retry time, coupled with a TCP-like congestion windowing implementation when issuing SA queries. For example: Never allow more than, say, 8 SA queries outstanding at a time. If an SA query times out, reduce the number of outstanding queries to 1 until we get a response, then double the number of queries allowed to be outstanding until we reach the max. Have the mad layer calculate the SA query timeout based on the actual SA response time, with randomization based on that. The user specified timeout value can basically be ignored. The only reason I'm suggesting we restrict the algorithm to SA queries is to avoid storing per endpoint information. That may be better handled by the CM (since CM responses are sends). Given all this, then I think it would be okay to accept the patch to drop busy responses from the SA until this framework is in place, which wouldn't be until 2.6.38 or 39. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
