Hal, At the bottom of this is a slight rewrite of my previous email (and a tweak to the patch) to address your concerns and to make things more clear. Other items are answered inline.
>What experience/confidence is there in this (specific) randomization >policy ? On what (how large) IB cluster sizes has this policy been tried >? Is this specific policy modeled from other policies in use elsewhere ? To explicitly discuss this: The old Infinicon stack added 1 second on each successive retry, but didn't randomize. I modeled this algorithm after the Ethernet model but I chose the terms to be on the same order of magnitude as we typically use for MAD timeouts. I can't claim to have any special experience showing this particular policy is best except to say that the principles are sound. >Also, is this randomized timeout used on RMPP packets if this parameter > is not 0 ? If the module parameter is non-zero then yes, it will coerce all timeouts for all MAD requests to randomize. Keep in mind that this code doesn't change how packets are processed when they timeout, it just changes how the timeout is calculated. >> Finally, I've added a module parameter to coerce all mad work requests to >> use this feature if desired. >On one hand, I don't want to introduce unneeded parameters/complexity >but I'm wondering whether more granularity is useful on which requests >(classes ?) this applies to. For example, should SM requests be >randomized ? This feature is primarily an SA thing although busy can be >used for other management classes but it's use is mainly GS related. First, I think we should separate this from the BUSY handling issue - not because they aren't connected but because every time I start focusing on these things I promptly get yanked onto something else. Hopefully we can focus on just the randomization aspect and bring it to a satisfactory agreement first, then I'll re-submit the BUSY handling patch based on that. That said, there's been some argument over whether the best place for choosing the retry policy is in ib_mad or in the individual ulps and apps. The intent of the module parameter is to provide relief on larger clusters while waiting for the authors of other components to modify their models. I do also think randomizing on retry is just as applicable for SM requests as for SA - if requests are timing out, then the SA/SM is getting overloaded, regardless of the type of request. ----------------------------- Design notes: This patch builds upon a discussion we had earlier this year on adding a backoff function when retrying MAD sends after a timeout. The current behavior is to retry MAD requests at a fixed interval, specified by the caller, and no more than the number of times specified by the caller. The problem with this approach is that if the same application or ulp is installed on many hundreds (or thousands) of nodes, all using the same retry interval, they could all end up retrying at roughly the same time, causing repeatable packet storms. On a large cluster, these storms can effectively act as a denial of service attack. To get around this, the retry timer should have a randomization component of a similar order of magnitude as the retries themselves. Since retries are usually on the order of one second, the patch defines the randomization component as between zero and roughly 1/2 second (511 ms) although the upper limit can tuned by changing a #define. The other standard method for prevent storms of retries is to implement an exponential backoff, such as is used in the Ethernet protocol. However, because the user has also explicitly specified a timeout value, I chose to treat that value as a minimum delay, then I add an exponential value on top of that, defined as BASE*2^c, where 'c' is the number of retries already attempted, minus 1. Currently, the base value is defined as 511 ms (1/2 second), so that the retry interval is defined as: (minimum timeout) + 511<<c - (random value between 0 & 511) This causes the following retry times: 0: minimum timeout 1: minimum timeout + (random value between 0 & 511) 2: minimum timeout + 1 second - (random value between 0 & 511) 3: minimum timeout + 2 seconds - (random value between 0 & 511) 4: minimum timeout + 4 seconds - (random value between 0 & 511) . . . c: minimum timeout + (1/2 second)*2^c - (random value between 0 & 511) (For comparison, the old Silverstorm/Infinicon stack waited 1 second * the number of retries.) ------------------------ Implementation: This patch does NOT implement the ABI/API changes that would be needed to take advantage of the new features, but it lays the groundwork for doing so. In addition, it provides a new module parameter that allow the administrator to coerce existing code into using the new capability: parm: randomized_wait: When true, use a randomized backoff algorithm to control retries for timeouts. (int) Note that this parameter will not force retries if the caller specified 0 retries. Next, I've added a new field called "randomized_wait" to the ib_mad_send_buf structure. If this parameter is set, each time the WR times out, the timeout for the next retry is set to (send_wr->timeout_ms + 511<<(send_wr->retries) - random32()&511). In other words, on the first retry, the randomization code will add between 0 and 1/2 second to the timeout. On the second retry, it will add between 0.5 and 1.0 seconds to the timeout, on the 3rd, between 1.5 and 2 seconds, on the 4th, between 3.5 and 4, et cetera. In addition, a new field, total_timeout has been added to the ib_mad_send_wr_private. My plan is that if the caller specifies randomized retries, total_timeout will be set to send_wr->timeout and send_wr->timeout will be set to the base (default) timeout as its initial value. If randomized_wait is set, total_timeout will instead be set to (send_wr->timeout * send_wr->max_retries). In either case, retries cannot exceed this total time, even if that will mean a lower number of retry attempts.
randomized_mad_timeout.patch
Description: randomized_mad_timeout.patch
