Rui Machado wrote:
Hi,

when setting the timeout in a struct ibv_qp_attr, this value
corresponds to the Local ACK timeout which according to the Infiniband
spec will define the transport timer timeout defined by the formula:
4.096uS * 2 ^Local Ack timeout". Is this right?
And is there a value for this timeout to be considered "good practice"?

This value is depend on your fabric size, on the HCA you have (and some more 
factors)..
Also, in a client-server setup, if this timeout is set to a "big
value" (like 30) when the server dies, the client will take that
amount of time to realize the failure. Is this correct?

Yes, after (at least) the calculated time * number of retry_count usec, the 
sender QP will get a retry exceeded
(if there was a SR which was posted without any response from the receiver).

hmm..... and is there no workaround for this, for this situation? I
mean, if the server dies isn't there any possibility that
the sender/client realizes this. If the timeout it's too large this
can be cumbersome.

I tried reducing the timeout and indeed the client realizes faster
when the server exits but another problem arises: Without exiting the
server,
on the client side I get the error (retry exceed) when polling for a
recently posted send - this after some hours.
You don't really need to set a timeout of hours, I believe that a few seconds should be enough for
almost any (todays) cluster...


Thank you for the help.
You are welcome
:)

Dotan
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to