Re: API and terms: idle-time-out and heartbeat intervals.

Robbie Gemmell Wed, 28 Sep 2016 14:54:06 -0700

On 28 September 2016 at 21:36, Alan Conway <[email protected]> wrote:
> On Wed, 2016-09-28 at 13:26 -0700, Justin Ross wrote:
>> IMO, the overall picture is simpler, and easier to explain to third
>> parties, if we go the way Ken suggested.  When a remote peer sends
>> you an
>> idle timeout value, it is an expression of an (actual, not simply
>> "advertised") guarantee - "I will expire the connection after X time
>> without receiving a frame from you".
>
> It is easy to explain either way, once you are clear what the semantics
> are. Here's the doc from my current Go API:
>
> In ConnectionSettings (this is querying what the remote set):
>         // Heartbeat is the maximum delay between sending frames that
> the remote peer
>         // has requested of us. If the interval expires an empty
> "heartbeat" frame
>         // will be sent automatically to keep the connection open.
>
>
>>
>> We could also legitimately go the direction you suggest.  *But* its
>> name is
>> "idle timeout".  We can't easily change the name.  I think we should
>> take
>> the spec text that goes with the name, and the behavior of our
>> components,
>> firmly in the direction Ken suggests.
>
> The spec text does not support this interpretation - it's poorly
> written I will grant you, but the parts that are clear are, well,
> clear:
>
> "At connection open each peer communicates the **maximum
> period between activity (frames)** on the connection that it desires
> from its partner."
>
> I don't see how there's any way to read that other than that this is
> the required frame interval for the frame sender. It is *not* the
> threshold for closing the connection, and the spec states:
>
> "To avoid spurious timeouts, the value in idle-time-out SHOULD be half
> the peer’s actual timeout threshold."
>
> Again I see no way to read that where you can conclude that
>     idle-time-out == connection close threshold.
>
> The use of the English words "idle timeout" is exceedingly sloppy and
> gives rise to confusion, but the use of the formal parameter name
> "idle-time-out" is not.
>


I agree that the spec as written defines the advertised idle-timeout
as the maximum time between sending frames, and separately the
connection close threshhold as being a time after which a peer will
close the connection. Unfortunately its use of the word SHOULD there
means it does not say those values are different. It clearly
recommends they aren't, and I think folks can see that doing otherwise
wouldn't be likely to result in the most reliable of mechanisms for
avoiding spurious timeout, but it doesn't require it, hence the
current proton behaviour. That isn't to say the current behaviour
needs to be as conservative as it is.


>>
>> Off topic: why is this on the dev list?
>>
>>
>> On Wed, Sep 28, 2016 at 12:36 PM, Alan Conway <[email protected]>
>> wrote:
>>
>> >
>> > On Wed, 2016-09-28 at 10:13 -0400, Ken Giusti wrote:
>> > >
>> > > I've had a hand in the way Proton/C interprets the meaning of
>> > > 'idle-
>> > > timeout' and I've never liked the solution.  I think Proton/C's
>> > > behavior is not 'pessimistic' as much as it is 'conservative' for
>> > > the
>> > > sake of interoperability.  This, unfortunately ends up with a
>> > > needless idle frame chattiness when both ends are Proton-based.
>> > >
>> > > ----- Original Message -----
>> > > >
>> > > >
>> > > > From: "Rob Godfrey" <[email protected]>
>> > > > To: "qpid" <[email protected]>
>> > > > Sent: Wednesday, September 28, 2016 6:19:05 AM
>> > > > Subject: Re: API and terms: idle-time-out and heartbeat
>> > > > intervals.
>> > > >
>> > > > I agree that specifying that the communicated figure should be
>> > > > "half"
>> > > > the "actual" timeout was a mistake.
>> > > >
>> > > > What the spec should have tried to communicate is that the
>> > > > sender
>> > > > should communicate a value somewhat less than the period it
>> > > > uses to
>> > > > determine that the connection has actually timed-out to allow
>> > > > for
>> > > > the
>> > > > receiver to process and emit a heartbeat frame.
>> > >
>> > >
>> > > Wouldn't it be much clearer to simply send the _actual_ idle
>> > > timeout
>> > > value?
>> >
>> > My read is that is exactly what it does: It sends the max time that
>> > the
>> > *sender* of frames may be idle. The receiver of frames SHOULD be
>> > more
>> > patient than that. The wording of the "discussion" around it and
>> > the
>> > choice of terms is a bit cloudy but, the text that describes idle-
>> > time-
>> > out seems clear enough: it is the max interval between sending
>> > frames.
>> > The frame receiver SHOULD wait longer that that before closing, and
>> > 2x
>> > seems a reasonable suggestion, but that's for the impl to decide.
>> >
>> > It's weird that it says "idle-time-out should be half the
>> > threshold"
>> > instead of "the threshold should be twice the idle-time-out" but
>> > it's
>> > logically equivalent.
>> >
>> >
>> > >
>> > > Having the spec suggest "communicating a value *somewhat less*"
>> >
>> > The wording is odd but the semantics are you communicate *exactly*
>> > the
>> > max frame delay you want and then you SHOULD set your connection
>> > close
>> > threshold to something bigger. The other end doesn't need to know
>> > how
>> > much bigger, they just need to know what rate to send frames.
>> >
>> > >
>> > > [emphasis mine] leaves the implementation open for interpretation
>> > > -
>> > > which is exactly how we got into this mess in the first
>> > > place.  Developers are a smart bunch - they know that keep alive
>> > > traffic will have to be sent frequently enough to prevent idle
>> > > timeout.
>> > >
>> > >
>> > > >
>> > > >
>> > > >  Similarly the sender
>> > > > should ensure that a frame has been emitted well within the
>> > > > timeout
>> > > > period to allow for any communication / processing delay.
>> > >
>> > > Agreed - perfectly acceptable for the spec to point this out.
>> > >
>> > > >
>> > > >
>> > > >  In practice
>> > > > these "wiggle room" factors should not be determined by the
>> > > > application level timeout setting but by sensible calculations
>> > > > on
>> > > > transport delay variance / processing time, etc...  these
>> > > > calculation
>> > > > may differ between different use-cases / environments (for
>> > > > example
>> > > > in
>> > > > a low latency / real-time environment you may be able to make
>> > > > hard
>> > > > guarantees about the number of milliseconds that communication
>> > > > /
>> > > > processing delay will take... on the other hand if you are
>> > > > using an
>> > > > interpreted language with stop-the-world garbage collection you
>> > > > may
>> > > > not be able to say much better than the delay should be less
>> > > > than
>> > > > 30s
>> > > > or whatever).
>> > > >
>> > >
>> > > Yes - very important things to keep in mind when implementing
>> > > this.  But the spec shouldn't be making these suggestions for
>> > > different implementation options. The spec should be as concise
>> > > as
>> > > possible about the mandated behavior, and leave the
>> > > implementation to
>> > > the developers.
>> > >
>> > > >
>> > > >
>> > > > I think application level APIs should be in terms of the
>> > > > timeouts
>> > > > that
>> > > > will affect the application.  The AMQP library should be
>> > > > massaging
>> > > > those numbers in such a way that they can fulfil the
>> > > > application
>> > > > requirements.
>> > > >
>> > >
>> > > Agreed.  Now, is there _any_ way we can suggest an update to the
>> > > spec?  Perhaps an errata, etc?
>> > >
>> > > >
>> > > >
>> > > > -- Rob
>> > > >
>> > > > On 28 September 2016 at 10:42, Robbie Gemmell <robbie.gemmell@g
>> > > > mail
>> > > > .com>
>> > > > wrote:
>> > > > >
>> > > > >
>> > > > > On 27 September 2016 at 22:24, Alan Conway <[email protected]
>> > > > > m>
>> > > > > wrote:
>> > > > > >
>> > > > > >
>> > > > > > On Tue, 2016-09-27 at 15:37 -0400, Alan Conway wrote:
>> > > > > > >
>> > > > > > >
>> > > > > > > I want to clarify and document the meaning of these terms
>> > > > > > > for
>> > > > > > > our
>> > > > > > > APIs,
>> > > > > > > presently I can't find anywhere where they are documented
>> > > > > > > clearly.
>> > > > > > >
>> > > > > > > The AMQP spec says: "Each peer has its own (independent)
>> > > > > > > idle
>> > > > > > > timeout.
>> > > > > > > At connection open each peer communicates the maximum
>> > > > > > > period between activity (frames) on the connection that
>> > > > > > > it
>> > > > > > > desires
>> > > > > > > from
>> > > > > > > its partner.The open frame carries the idletime-out
>> > > > > > > field for this purpose. To avoid spurious timeouts, the
>> > > > > > > value
>> > > > > > > in
>> > > > > > > idle-
>> > > > > > > time-out SHOULD be half the peer’s
>> > > > > > > actual timeout threshold."
>> > > > > > >
>> > > > > > > In other words: if I send you an "open" frame with idle-
>> > > > > > > time-
>> > > > > > > out=N
>> > > > > > > that
>> > > > > > > means *you* should not wait for longer than N
>> > > > > > > milliseconds to
>> > > > > > > send a
>> > > > > > > frame to me. It does not mean *I* will close the
>> > > > > > > connection
>> > > > > > > after N
>> > > > > > > milliseconds, I SHOULD be more patient and wait for N*2
>> > > > > > > ms to
>> > > > > > > avoid
>> > > > > > > closing prematurely due to minor timing wobbles.
>> > > > > > >
>> > > > > > > I think the choice of name is slightly ambiguous but the
>> > > > > > > spec
>> > > > > > > is
>> > > > > > > clear
>> > > > > > > on the semantics, so it's important to document it to
>> > > > > > > remove
>> > > > > > > the
>> > > > > > > ambiguity.
>> > > > > > >
>> > > > > > > Anybody disagree?
>> > > > > > >
>> > > > > >
>> > > > > > Sigh. Sadly proton-C interprets "idle-timeout" differently
>> > > > > > depending on
>> > > > > > which end of the connection you are on:
>> > > > > >
>> > > > > >       // as per the recommendation in the spec, advertise
>> > > > > > half
>> > > > > > our
>> > > > > >       // actual timeout to the remote
>> > > > > >       const pn_millis_t idle_timeout = transport-
>> > > > > > >
>> > > > > > > local_idle_timeout
>> > > > > >           ? (transport->local_idle_timeout/2)
>> > > > > >           : 0;
>> > > > > >
>> > > > > > So in proton, pn_set_idle_timeout does NOT mean set the
>> > > > > > AMQP
>> > > > > > idle-
>> > > > > > timeout value, it means set the local "receive timeout"
>> > > > > > value
>> > > > > > and send
>> > > > > > half that as the AMQP "send timeout" for the peer.
>> > > > > >
>> > > > > > I'm tempted to use a new term in the Go API: "heartbeat".
>> > > > > > To me
>> > > > > > that
>> > > > > > clearly means the "send timeout" (hearts beat, they don't
>> > > > > > listen for
>> > > > > > beats) so it coincides with the meaning of the AMQP "idle-
>> > > > > > timeout", but
>> > > > > > without the ambiguity that is exacerbated by proton
>> > > > > > interpreting it
>> > > > > > both ways.
>> > > > > >
>> > > > > >
>> > > > >
>> > > > > Proton may seem to behave differently on each end, but I
>> > > > > don't
>> > > > > think
>> > > > > its necessarily a bad thing that it does, and it is also I
>> > > > > think
>> > > > > largely just reflecting an annoying bit in the spec around
>> > > > > this
>> > > > > where
>> > > > > different behaviours are allowed for, whereas it would be
>> > > > > easier
>> > > > > if it
>> > > > > had less wiggle room.
>> > > > >
>> > > > > The transport setter/getter for the local timeout takes the
>> > > > > 'actual
>> > > > > timeout' and then sends half of it as the advertised value in
>> > > > > the
>> > > > > Open
>> > > > > sent. This makes a certain amount of sense since it ensures
>> > > > > that
>> > > > > appropriate behaviour is actually satisfied, rather than
>> > > > > expecting the
>> > > > > user to ensure they only give half the value they really want
>> > > > > for
>> > > > > their actual timeout. The getter for the remote timeout value
>> > > > > on
>> > > > > the
>> > > > > other hand returns the advertised value from the Open that is
>> > > > > received. I expect it does that since it cant actually ever
>> > > > > return the
>> > > > > remotes 'actual timeout' without making an assumption, i.e
>> > > > > that
>> > > > > they
>> > > > > did in fact advertise half (or less) of their actual timeout,
>> > > > > which
>> > > > > the spec only says that they SHOULD do.
>> > > > >
>> > > > > Yes the local setter taking the advertised value may have
>> > > > > been
>> > > > > better
>> > > > > for method consistency with the remote getter. On the other
>> > > > > hand,
>> > > > > sending of necessary heartbeats is handled directly by the
>> > > > > transport
>> > > > > during the tick process, so users may not necessarily even
>> > > > > use
>> > > > > the
>> > > > > getter themselves, and proton uses that remote value
>> > > > > internally
>> > > > > by
>> > > > > pessimistically halfing it to account for the case that folks
>> > > > > on
>> > > > > the
>> > > > > other end did not advertise half their actual timeout (since
>> > > > > the
>> > > > > spec
>> > > > > doesnt require that they do). Side note: proton could
>> > > > > arguably be
>> > > > > less
>> > > > > pessimistic here and go for say a percentage much nearer the
>> > > > > full
>> > > > > advertised value, but then you'd probably need to start
>> > > > > guaging
>> > > > > how
>> > > > > close is too close.
>> > > > >
>> > > > > I think ensuring the doccumentation on the methods is clear
>> > > > > what
>> > > > > they
>> > > > > do is sufficient enough here. I actually prefer idle-timeout
>> > > > > as
>> > > > > an
>> > > > > name rather than heartbeat due to the way this all works.
>> > > > > Since
>> > > > > you
>> > > > > only tell the other side [half] your timeout, you dont
>> > > > > actually
>> > > > > have
>> > > > > direct control over when they send any needed empty frames to
>> > > > > satisfy
>> > > > > it (as the above shows, we might send them more often than
>> > > > > they
>> > > > > require) and 'heartbeat' might seem to imply that you do, and
>> > > > > possibly
>> > > > > even that they need be sent at that period all the time even
>> > > > > despite
>> > > > > regular traffic, which is not the case.
>> > > > >
>> > > > > Robbie
>> > > > >
>> > > > > -----------------------------------------------------------
>> > > > > ----
>> > > > > ------
>> > > > > To unsubscribe, e-mail: [email protected]
>> > > > > For additional commands, e-mail: [email protected]
>> > > > >
>> > > >
>> > > > -------------------------------------------------------------
>> > > > ----
>> > > > ----
>> > > > To unsubscribe, e-mail: [email protected]
>> > > > For additional commands, e-mail: [email protected]
>> > > >
>> > > >
>> > >
>> >
>> >
>> > -----------------------------------------------------------------
>> > ----
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>> >
>> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: API and terms: idle-time-out and heartbeat intervals.

Reply via email to