Hal,

At the bottom of this is a slight rewrite of my previous email (and a tweak to 
the patch) to address your concerns and to make things more clear. Other items 
are answered inline.

>What experience/confidence is there in this (specific) randomization 
>policy ? On what (how large) IB cluster sizes has this policy been tried 
>? Is this specific policy modeled from other policies in use elsewhere ?

To explicitly discuss this: The old Infinicon stack added 1 second on each 
successive retry, but didn't randomize. I modeled this algorithm after the 
Ethernet model but I chose the terms to be on the same order of magnitude as we 
typically use for MAD timeouts. I can't claim to have any special experience 
showing this particular policy is best except to say that the principles are 
sound.

>Also, is this randomized timeout used on RMPP packets if this parameter 
> is not 0 ?

If the module parameter is non-zero then yes, it will coerce all timeouts for 
all MAD requests to randomize. Keep in mind that this code doesn't change how 
packets are processed when they timeout, it just changes how the timeout is 
calculated.

>> Finally, I've added a module parameter to coerce all mad work requests to 
>> use this feature if desired.

>On one hand, I don't want to introduce unneeded parameters/complexity 
>but I'm wondering whether more granularity is useful on which requests 
>(classes ?) this applies to. For example, should SM requests be 
>randomized ? This feature is primarily an SA thing although busy can be 
>used for other management classes but it's use is mainly GS related.

First, I think we should separate this from the BUSY handling issue - not 
because they aren't connected but because every time I start focusing on these 
things I promptly get yanked onto something else. Hopefully we can focus on 
just the randomization aspect and bring it to a satisfactory agreement first, 
then I'll re-submit the BUSY handling patch based on that. 

That said, there's been some argument over whether the best place for choosing 
the retry policy is in ib_mad or in the individual ulps and apps. The intent of 
the module parameter is to provide relief on larger clusters while waiting for 
the authors of other components to modify their models. I do also think 
randomizing on retry is just as applicable for SM requests as for SA - if 
requests are timing out, then the SA/SM is getting overloaded, regardless of 
the type of request.


-----------------------------
Design notes:

This patch builds upon a discussion we had earlier this year on adding a
backoff function when retrying MAD sends after a timeout.

The current behavior is to retry MAD requests at a fixed interval, specified by
the caller, and no more than the number of times specified by the caller.

The problem with this approach is that if the same application or ulp is
installed on many hundreds (or thousands) of nodes, all using the same retry
interval, they could all end up retrying at roughly the same time, causing
repeatable packet storms. On a large cluster, these storms can effectively act
as a denial of service attack. To get around this, the retry timer should have
a randomization component of a similar order of magnitude as the retries
themselves. Since retries are usually on the order of one second, the patch
defines the randomization component as between zero and roughly 1/2 second
(511 ms) although the upper limit can tuned by changing a #define.

The other standard method for prevent storms of retries is to implement an
exponential backoff, such as is used in the Ethernet protocol. However, because
the user has also explicitly specified a timeout value, I chose to treat
that value as a minimum delay, then I add an exponential value on top of that,
defined as BASE*2^c, where 'c' is the number of retries already attempted,
minus 1.

Currently, the base value is defined as 511 ms (1/2 second), so that the
retry interval is defined as:

(minimum timeout) + 511<<c - (random value between 0 & 511)

This causes the following retry times:

0:      minimum timeout
1:      minimum timeout + (random value between 0 & 511)
2:      minimum timeout + 1 second - (random value between 0 & 511)
3:      minimum timeout + 2 seconds - (random value between 0 & 511)
4:      minimum timeout + 4 seconds - (random value between 0 & 511)
.
.
.
c:      minimum timeout + (1/2 second)*2^c - (random value between 0 & 511)

(For comparison, the old Silverstorm/Infinicon stack waited 1 second *
the number of retries.)

------------------------

Implementation:

This patch does NOT implement the ABI/API changes that would be needed to take
advantage of the new features, but it lays the groundwork for doing so. In
addition, it provides a new module parameter that allow the administrator to
coerce existing code into using the new capability:

parm: randomized_wait: When true, use a randomized backoff algorithm to control
retries for timeouts. (int)

Note that this parameter will not force retries if the caller specified 0
retries.

Next, I've added a new field called "randomized_wait" to the ib_mad_send_buf
structure. If this parameter is set, each time the WR times out, the timeout
for the next retry is set to

(send_wr->timeout_ms + 511<<(send_wr->retries) - random32()&511).

In other words, on the first retry, the randomization code will add between 0
and 1/2 second to the timeout. On the second retry, it will add between 0.5 and
1.0 seconds to the timeout, on the 3rd, between 1.5 and 2 seconds, on the 4th,
between 3.5 and 4, et cetera. In addition, a new field, total_timeout has been
added to the ib_mad_send_wr_private. My plan is that if the caller specifies 
randomized retries, total_timeout will be set to send_wr->timeout and 
send_wr->timeout will be set to the base (default) timeout as its initial 
value. If randomized_wait is set, total_timeout will instead be set to
(send_wr->timeout * send_wr->max_retries).

In either case, retries cannot exceed this total time, even if that will mean a 
lower number of retry attempts.



Attachment: randomized_mad_timeout.patch
Description: randomized_mad_timeout.patch

Reply via email to