Rui Machado wrote:
Hi,
when setting the timeout in a struct ibv_qp_attr, this value
corresponds to the Local ACK timeout which according to the Infiniband
spec will define the transport timer timeout defined by the formula:
4.096uS * 2 ^Local Ack timeout". Is this right?
And is there a value for this timeout to be considered "good practice"?
This value is depend on your fabric size, on the HCA you have (and some more
factors)..
Also, in a client-server setup, if this timeout is set to a "big
value" (like 30) when the server dies, the client will take that
amount of time to realize the failure. Is this correct?
Yes, after (at least) the calculated time * number of retry_count usec, the
sender QP will get a retry exceeded
(if there was a SR which was posted without any response from the receiver).
hmm..... and is there no workaround for this, for this situation? I
mean, if the server dies isn't there any possibility that
the sender/client realizes this. If the timeout it's too large this
can be cumbersome.
I tried reducing the timeout and indeed the client realizes faster
when the server exits but another problem arises: Without exiting the
server,
on the client side I get the error (retry exceed) when polling for a
recently posted send - this after some hours.
You don't really need to set a timeout of hours, I believe that a few
seconds should be enough for
almost any (todays) cluster...
Thank you for the help.
You are welcome
:)
Dotan
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general