On 10/26/2010 2:33 PM, Mike Heinz wrote:
Resending. Didn't get any reply after the last posting.

-----Original Message-----
From: Mike Heinz
Sent: Monday, October 11, 2010 11:34 AM
To: '[email protected]'
Subject: [PATCH] Add exponential backoff + random delay to MADs when retrying 
after timeout.

This patch builds upon a discussion we had earlier this year on adding a 
backoff function when retrying MAD sends after a timeout.

Agreed that the fixed timer retry strategy is b0rken for some scenarios (e.g. BUSY handling). I think a description of this should be added to the description as background/motivation. You had previously written:

"The current behavior is to simply return the BUSY to the client or ULP, which is either treated as a permanent error or causes an immediate retry. This can be a big problem with, for example, ipoib which sets retries to 15 and (as I understand it) immediately retries to connect when getting an error response from the SA. Other ulps have similar settings. Without some kind of delay, starting up ipoib on a large fabric (at boot time, for example) can cause a real packet storm.

By treating BUSY replies identically to timeouts, this patch at least introduces a delay between attempts. In the case of the ULPs, the delay is typically 4 seconds.

This approach encourages applications to adjust their timeouts appropriately by treating BUSY responses as non-events and forcing the applications to wait for their request to time out.

Depending on the application developers to take BUSY responses into account seems to be asking for trouble - it allows one rogue app to bring the SA to its knees, for example. By enforcing this timeout model in the kernel, we guarantee that there will be at least some delay between each message when the SA is reporting a busy status."

Maybe some shorter version of the above should be part of this and/or your subsequent busy handling patch.

This patch does NOT implement the ABI/API changes that would be needed to take 
advantage of the new features, but it lays the groundwork for doing so. In 
addition, it provides a new module parameter that allow the administrator to 
coerce existing code into using the new capability.

First, I've added a new field called "randomized_wait" to the ib_mad_send_buf structure. If 
this parameter is set, each time the WR times out, the the timeout for the next retry is set to 
(send_wr->timeout_ms + 511<<(send_wr->retries) - random32()&511). In other words, on the 
first retry, the randomization code will add between 0 and 1/2 second to the timeout. On the second 
retry, it will add between 1 and 1.5 seconds to the timeout, on the 3rd, between 2 and 2.5 seconds, on 
the 4th, between 4 and 4.5, et cetera.

What experience/confidence is there in this (specific) randomization policy ? On what (how large) IB cluster sizes has this policy been tried ? Is this specific policy modeled from other policies in use elsewhere ?

Also, is this randomized timeout used on RMPP packets if this parameter is not 0 ?

In addition, a new field, total_timeout has been added to the ib_mad_send_wr_private 
and is initialized to (send_wr->timeout * send_wr->max_retries). Retries cannot 
exceed this total time, even though that will mean a lower number of retry attempts.

Why is the total timeout more important than number of retries in terms of terminating the transaction ?

Shouldn't there be another parameter as to whether this limits the retries or whether the number of retries should be the limiting factor ? I'm not sure about reducing the number of retries on a large fabric. I think the typical number used is 3 so this would be at most 2 depending on the per retry timeout. I think the default policy should be to preserve the number of retries rather than the total timeout.

Finally, I've added a module parameter to coerce all mad work requests to use 
this feature if desired.

On one hand, I don't want to introduce unneeded parameters/complexity but I'm wondering whether more granularity is useful on which requests (classes ?) this applies to. For example, should SM requests be randomized ? This feature is primarily an SA thing although busy can be used for other management classes but it's use is mainly GS related.

parm:           randomized_wait:When true, use a randomized backoff algorithm 
to control retries for timeouts. (int)

--------

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c 
index ef1304f..3b03f1c 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -42,6 +42,11 @@
  #include "smi.h"
  #include "agent.h"

+#include "linux/random.h"
+
+#define MAD_MIN_TIMEOUT_MS 511
+#define MAD_RAND_TIMEOUT_MS 511
+
  MODULE_LICENSE("Dual BSD/GPL");
  MODULE_DESCRIPTION("kernel IB MAD API");  MODULE_AUTHOR("Hal Rosenstock"); @@ -55,6 +60,10 
@@ MODULE_PARM_DESC(send_queue_size, "Size of send queue in number of work requests  
module_param_named(recv_queue_size, mad_recvq_size, int, 0444);  MODULE_PARM_DESC(recv_queue_size, "Size of 
receive queue in number of work requests");

+int mad_randomized_wait = 0;
+module_param_named(randomized_wait, mad_randomized_wait, int, 0444);
+MODULE_PARM_DESC(randomized_wait, "When true, use a randomized backoff
+algorithm to control retries for timeouts.");
+
  static struct kmem_cache *ib_mad_cache;

  static struct list_head ib_mad_port_list; @@ -1102,11 +1111,18 @@ int 
ib_post_send_mad(struct ib_mad_send_buf *send_buf,
                }

                mad_send_wr->tid = ((struct ib_mad_hdr *) send_buf->mad)->tid;
+
+               mad_send_wr->randomized_wait = mad_randomized_wait || 
send_buf->randomized_wait;
+               mad_send_wr->total_timeout = 
msecs_to_jiffies(send_buf->timeout_ms) *
+send_buf->retries;
+               
                /* Timeout will be updated after send completes */
                mad_send_wr->timeout = msecs_to_jiffies(send_buf->timeout_ms);
+
                mad_send_wr->max_retries = send_buf->retries;
                mad_send_wr->retries_left = send_buf->retries;
+               
                send_buf->retries = 0;
+               
                /* Reference for work request to QP + response */
                mad_send_wr->refcount = 1 + (mad_send_wr->timeout>  0);
                mad_send_wr->status = IB_WC_SUCCESS;
@@ -1803,6 +1819,7 @@ static void ib_mad_complete_recv(struct 
ib_mad_agent_private *mad_agent_priv,

        /* Complete corresponding request */
        if (ib_response_mad(mad_recv_wc->recv_buf.mad)) {
+
                spin_lock_irqsave(&mad_agent_priv->lock, flags);
                mad_send_wr = ib_find_send_mad(mad_agent_priv, mad_recv_wc);
                if (!mad_send_wr) {
@@ -1811,6 +1828,7 @@ static void ib_mad_complete_recv(struct 
ib_mad_agent_private *mad_agent_priv,
                        deref_mad_agent(mad_agent_priv);
                        return;
                }
+

Nit: no need for extra lines in this and above.

                ib_mark_mad_done(mad_send_wr);
                spin_unlock_irqrestore(&mad_agent_priv->lock, flags);

@@ -2429,14 +2447,33 @@ static int retry_send(struct ib_mad_send_wr_private 
*mad_send_wr)  {
        int ret;

-       if (!mad_send_wr->retries_left)
+       if (!mad_send_wr->retries_left || (mad_send_wr->total_timeout == 0))
                return -ETIMEDOUT;

        mad_send_wr->retries_left--;
        mad_send_wr->send_buf.retries++;

-       mad_send_wr->timeout = 
msecs_to_jiffies(mad_send_wr->send_buf.timeout_ms);
+       if (mad_send_wr->randomized_wait) {
+               mad_send_wr->timeout = 
msecs_to_jiffies(mad_send_wr->send_buf.timeout_ms +
+                       (MAD_MIN_TIMEOUT_MS<<mad_send_wr->send_buf.retries) -
+                       (random32()&MAD_RAND_TIMEOUT_MS));
+               if (mad_send_wr->timeout>  mad_send_wr->total_timeout) {
+                       mad_send_wr->timeout = mad_send_wr->total_timeout;
+                       mad_send_wr->total_timeout = 0;
+               } else {
+                       mad_send_wr->total_timeout -= mad_send_wr->timeout;
+               }
+       } else {
+               mad_send_wr->timeout = 
msecs_to_jiffies(mad_send_wr->send_buf.timeout_ms);
+       }

+       printk(KERN_DEBUG PFX "Retrying send %p: retries: %u, retries_left: %u, 
timeout: %lu, total_timeout: %lu\n",
+               mad_send_wr,
+               mad_send_wr->send_buf.retries,
+               mad_send_wr->retries_left,
+               mad_send_wr->timeout,
+               mad_send_wr->total_timeout);
+               
        if (mad_send_wr->mad_agent_priv->agent.rmpp_version) {
                ret = ib_retry_rmpp(mad_send_wr);
                switch (ret) {
diff --git a/drivers/infiniband/core/mad_priv.h 
b/drivers/infiniband/core/mad_priv.h
index 9430ab4..01fb7ed 100644
--- a/drivers/infiniband/core/mad_priv.h
+++ b/drivers/infiniband/core/mad_priv.h
@@ -132,8 +132,10 @@ struct ib_mad_send_wr_private {
        struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG];
        __be64 tid;
        unsigned long timeout;
+       unsigned long total_timeout;
        int max_retries;
        int retries_left;
+       int randomized_wait;
        int retry;
        int refcount;
        enum ib_wc_status status;
diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h index 
d3b9401..c3d6efb 100644
--- a/include/rdma/ib_mad.h
+++ b/include/rdma/ib_mad.h
@@ -77,6 +77,15 @@

  #define IB_MGMT_MAX_METHODS                   128

+/* MAD Status field bit masks */
+#define IB_MGMT_MAD_STATUS_SUCCESS                                             
0x0000
+#define IB_MGMT_MAD_STATUS_BUSY                                                
        0x0001
+#define IB_MGMT_MAD_STATUS_REDIRECT_REQD                               0x0002
+#define IB_MGMT_MAD_STATUS_BAD_VERERSION                               0x0004  
+#define IB_MGMT_MAD_STATUS_UNSUPPORTED_METHOD                  0x0008  
+#define IB_MGMT_MAD_STATUS_UNSUPPORTED_METHOD_ATTRIB   0x000c
+#define IB_MGMT_MAD_STATUS_INVALID_ATTRIB_VALUE                        0x001c
+

Another nit: Although this does no harm, these don't look needed/used in this patch so I don't think they should be part of this and moved to your subsequent busy patch.

-- Hal

  /* RMPP information */
  #define IB_MGMT_RMPP_VERSION                  1

@@ -246,6 +255,7 @@ struct ib_mad_send_buf {
        int                     seg_count;
        int                     seg_size;
        int                     timeout_ms;
+       int                     randomized_wait;
        int                     retries;
  };

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to