This patch builds upon feedback received earlier this year to add a "treat BUSY
as timeout" feature to ib_mad. It does NOT implement the ABI/API changes that
would be needed in user space to take advantage of the new feature, but it lays
the groundwork for doing so. In addition, it provides a new module parameter
that allow the administrator to coerce existing code into using the new
capability.
The patch builds upon the randomization/backoff patch I sent earlier today to
add a random factor to timeouts to prevent synchronized storms of MAD queries.
I chose to build upon the existing timeout handling because it seemed the best
way to add the functionality without
Initially, I had tried to completely separate BUSY retries from timeout
handling, but that seemed difficult due to the way the timeout code is
structured. As a result, true timeouts and busy handling still use the same
timeout values, but I was still able to address the idea of randomizing the
retry timeout if desired.
By default, the behavior of ib_mad with respect to BUSY responses is unchanged.
If, however, a send work request is provided that has the new "busy_wait"
parameter set, ib_mad will ignore BUSY responses to that WR, allowing it to
timeout and retry as if no response had been received.
Finally, I've added a module parameter to coerce all mad work requests to use
this new feature:
parm: treat_busy_as_timeout:When true, treat BUSY responses as if
they were timeouts. (int)
As I mentioned in the past, this change solves a problem we see in the real
world all the time (the SM being pounded by "unintelligent" queries) so I
strongly hope this meets your concerns.
----
diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 3b03f1c..9e5e566 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -60,6 +60,10 @@ MODULE_PARM_DESC(send_queue_size, "Size of send queue in
number of work requests
module_param_named(recv_queue_size, mad_recvq_size, int, 0444);
MODULE_PARM_DESC(recv_queue_size, "Size of receive queue in number of work
requests");
+int mad_wait_on_busy = 0;
+module_param_named(treat_busy_as_timeout, mad_wait_on_busy, int, 0444);
+MODULE_PARM_DESC(treat_busy_as_timeout, "When true, treat BUSY responses as if
they were timeouts.");
+
int mad_randomized_wait = 0;
module_param_named(randomized_wait, mad_randomized_wait, int, 0444);
MODULE_PARM_DESC(randomized_wait, "When true, use a randomized backoff
algorithm to control retries for timeouts.");
@@ -1120,6 +1124,7 @@ int ib_post_send_mad(struct ib_mad_send_buf *send_buf,
mad_send_wr->max_retries = send_buf->retries;
mad_send_wr->retries_left = send_buf->retries;
+ mad_send_wr->wait_on_busy = send_buf->wait_on_busy ||
mad_wait_on_busy;
send_buf->retries = 0;
@@ -1819,6 +1824,8 @@ static void ib_mad_complete_recv(struct
ib_mad_agent_private *mad_agent_priv,
/* Complete corresponding request */
if (ib_response_mad(mad_recv_wc->recv_buf.mad)) {
+ u16 busy =
__be16_to_cpu(mad_recv_wc->recv_buf.mad->mad_hdr.status) &
+ IB_MGMT_MAD_STATUS_BUSY;
spin_lock_irqsave(&mad_agent_priv->lock, flags);
mad_send_wr = ib_find_send_mad(mad_agent_priv, mad_recv_wc);
@@ -1829,6 +1836,17 @@ static void ib_mad_complete_recv(struct
ib_mad_agent_private *mad_agent_priv,
return;
}
+ printk(KERN_DEBUG PFX "Completing recv %p: busy = %d,
retries_left = %d, wait_on_busy = %d\n",
+ mad_send_wr, busy, mad_send_wr->retries_left,
mad_send_wr->wait_on_busy);
+ if (busy && mad_send_wr->retries_left &&
mad_send_wr->wait_on_busy) {
+ /* Just let the query timeout and have it requeued
later */
+ spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
+ ib_free_recv_mad(mad_recv_wc);
+ deref_mad_agent(mad_agent_priv);
+ printk(KERN_INFO PFX "SA/SM responded MAD_STATUS_BUSY.
Allowing request to time out.\n");
+ return;
+ }
+
ib_mark_mad_done(mad_send_wr);
spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
diff --git a/drivers/infiniband/core/mad_priv.h
b/drivers/infiniband/core/mad_priv.h
index 01fb7ed..1d0629e 100644
--- a/drivers/infiniband/core/mad_priv.h
+++ b/drivers/infiniband/core/mad_priv.h
@@ -135,6 +135,7 @@ struct ib_mad_send_wr_private {
unsigned long total_timeout;
int max_retries;
int retries_left;
+ int wait_on_busy;
int randomized_wait;
int retry;
int refcount;
diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h
index c3d6efb..3da55c3 100644
--- a/include/rdma/ib_mad.h
+++ b/include/rdma/ib_mad.h
@@ -255,6 +255,7 @@ struct ib_mad_send_buf {
int seg_count;
int seg_size;
int timeout_ms;
+ int wait_on_busy;
int randomized_wait;
int retries;
};
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html