On Tue, 14 Jun 2005, Hal Rosenstock wrote:
On Mon, 2005-06-13 at 18:33, James Lentini wrote:
On Mon, 13 Jun 2005, Hal Rosenstock wrote:
halr> On Wed, 2005-06-08 at 17:53, James Lentini wrote:
halr> > On Wed, 8 Jun 2005, Hal Rosenstock wrote:
halr> >
halr> > halr> On Wed, 2005-06-08 at 11:44, James Lentini wrote:
halr> > halr> > We interpreted the above to mean "give the connection protocol
as
halr> > halr> > much time as it needs to establish a connection, but don't mask
halr> > halr> > errors (no path to the remove node, etc.)". For that reason we
changed
halr> > halr> > the variable name to DAT_TIMEOUT_MAX.
halr> > halr>
halr> > halr> But if the REQ is lost, the timeout is really really long (longer
than
halr> > halr> most will wait for an error).
halr> >
halr> > If a user doesn't want to wait DAT_TIMEOUT_MAX time, it can pass a
halr> > smaller amount of time to dat_ep_connect. Does this satisfy your
halr> > requirements?
halr>
halr> Is it the intended that the only way out is via user intervention (e.g.
halr> ctl-C) ? If one connection attempt (REQ) is made and it is lost, then
halr> there is no chance of it completing and the user needs to intervene.
Why does the user need to intervene? Did I misunderstanding the CM
API?
When dapl_ep_connect() is called with a timeout value of
DAT_TIMEOUT_MAX, DAPL passes ib_sen_cm_req the value 0x1F in the
ib_cm_req_param structure's remote_cm_response_timeout value. My
understanding was that this is the maximum timeout and that once it
expires the CM will inform the user that the REQ timed out.
Yes but it is a long time (4.096 * 2 ^ 31 usec ~ 8796 sec ~ 146.60 min
(if my calcs are correct)). This is longer than (most) users would wait.
They would usually hit ctl-C before this timeout is reached.
Understood. As long as it is not infinite we've made a step in the
right direction. I like your ideas below on how to improve this
further.
halr> If that is the intended behavior, we are there. (This (lost REQ)
halr> can even occur when the timeout is non infinite too).
We didn't intend for the active side to wait forever if a REQ was
lost.
The active side has no way of knowing that the REQ was lost (other than
timeout/retry) and when the timeout is long, this is effectively the
case.
This behavior is ok. The DAT consumer should choose timeout value that
makes sense, it doesn't need to use DAT_TIMEOUT_MAX (and probably
shouldn't in most cases). We should update our dapltest program to use
a smaller value (like 1 min).
halr> An alternative (as Sean suggested) is to continually retry (at a
halr> periodicity below the supplied timeout) until the time period specified
halr> expires. That seems to be better (at least to me and Sean) in terms of
halr> handling the lost REQ case. As retries is not part of the API for
halr> connect, I would presume the implementor is free to what they want under
halr> the covers of dapl_ib_connect.
You're correct.
The current implementation is:
1. address resolution phase for some amount of time
followed by:
2. dapl_ib_connect timeout * 5 (since there are 4 retries)
Sounds like I need to understand the difference between the
ib_cm_req_param's retry_count and max_cm_retries fields. We set the
former to 0 and the later to 4.
A better algorithm would be to divide down the timeout by some number of
retries (which would vary based on the timeout requested) and have the
number of retries vary based on the total timeout requested.
I agree that would be better. As you point out, we should also account
for the address resolution time. I know that no one is working on
this. Are you interested?
-- Hal
_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general