Sunkyoung Shin wrote:
During failover test, we found the iscsi over iser reconnected to the
iscs target after 100 seconds due to the default max timeout (8sec) and
retry number (15). The max timeout was adjustable with the module
parameter, max_timeout, of ib_cm.ko, but the retry number wasn't. Can we
add the retry number as module parameter of rdma_cm.ko? I added the
patch below based on the ofed version, OFED-1.2-20070626-0917.

I understand that you want the QP timeout/retries to be smaller, and not the CM timeout/retries and hence there might be some confusion here which the following rdma-cm code snip from cma_connect_ib() might help resolving:
...
        req.qp_num = id_priv->qp_num;
        req.qp_type = IB_QPT_RC;
        req.starting_psn = id_priv->seq_num;
        req.responder_resources = conn_param->responder_resources;
        req.initiator_depth = conn_param->initiator_depth;
        req.flow_control = conn_param->flow_control;
        req.retry_count = conn_param->retry_count;
        req.rnr_retry_count = conn_param->rnr_retry_count;
        req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT;
        req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT;
        req.max_cm_retries = CMA_MAX_CM_RETRIES;
        req.srq = id_priv->srq ? 1 : 0;

        ret = ib_send_cm_req(id_priv->cm_id.ib, &req);
...

The user is in total control on the QP retry count through the rdma-cm connection param structure, the req.max_cm_retries has nothing to do with the QP timeout.

The RC QP timeout is derived by the IB CM internally (on ofed through module param which you have changed) and the rdma-cm nor its consumer have direct control on it.

This follows the IB spec spirit that the SM/SA is the one to calculate and return to the host a param named "this path packet life time" so the IB CM combines the packet life time and something called the "hca ack delay". Currently the IB CM just 2 * path.packet_life_time as an estimation for the timeout which is the packet life time plus the hca ack delay, see cm_init_av_by_path() in core/cm.c .

Note that the actual timeout T = 4.096us * 2^t where t is the value plugged into the QP. Hence doing t = path.packet_life_time + 1 does what I described above.

In examination I did on the past I think that the openSM always returns
path.packet_life_time = 18 and same for some vendor SMs. This means that the timeout is 2^(2+18+1) = 2^21us = 2 seconds

The # retries set by the iser initiator are seven (see iser_route_handler()) so seven times two give 14 seconds, which makes your report on the 100 seconds it took the initiator to reconnect to possibly point on the different problem.

Or.


_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to