Re: API and terms: idle-time-out and heartbeat intervals.

Rob Godfrey Fri, 30 Sep 2016 06:40:53 -0700

I don't think it's a bug - it's a completely valid (if chatty) choice.

To my mind the semantics of the field are this:


The sender of the open frame is saying "if I do not receive any data
for this period of time, I reserve the right to assume that you are no
longer functioning correctly"
On receiving this information, the peer should decide on a strategy
that ensures to the best of its ability, that if it is functioning
normally it will be generating data sufficiently frequently that the
condition does not occur.  If this peer knows that it is susceptible
to delays outside its direct control (such as garbage collection
pauses) it should take account of this in how often it schedules
sending heartbeat frames.
At the same time, the sender of this value should give some leeway
before actually shutting down to account for unexpected transport
delays, or processing delays on its side.  Since it is unaware of the
nature of the allowances made at the peer, it may decide to be
particularly generous in the leeway it grants.

I think the spec mentioning a ratio (such as twice) is massively unhelpful.

In terms of the API - what is the applications intent in setting the
value?  Is the intent to ensure transport activity within a specific
period, or to set a hard limit on how long the library will wait
before generating some sort of timeout exception?

-- Rob


On 30 September 2016 at 14:25, Ken Giusti <kgiu...@redhat.com> wrote:
> CC'ing user's list
>
> Given the interpretations of the spec discussed below is the way Proton 
> handles this functionality inconsistent with itself?
>
> 1) Proton sends an open frame with _half_ the value of the timeout as set by 
> the application.
> 2) Proton interprets the idle-timeout in the received open as the actual time 
> out interval, and pessimistically ;) generates heartbeats as 1/2 that value.
>
> Is this a bug in the Proton implementation?
>
> -K
>
> ----- Original Message -----
>> From: "Alan Conway" <acon...@redhat.com>
>> To: dev@qpid.apache.org
>> Sent: Thursday, September 29, 2016 9:11:51 AM
>> Subject: Re: API and terms: idle-time-out and heartbeat intervals.
>>
>> On Wed, 2016-09-28 at 22:33 +0100, Robbie Gemmell wrote:
>> > The spec unfortunately says what it says, and even if there were
>> > scope
>> > to update it, I'm not sure there is a way to change it to address the
>> > main issue (which regardless of any other clarity issues, is I think
>> > just that it only says you SHOULD advertise half your 'actual
>> > timeout') that wouldn't result in an 'incompatible change' of
>> > behaviour.
>>
>> I look at this differently. We agree on the semantics of idle-time-out
>> defined by the spec. Now forget the spec wording, it is easy to explain
>> clearly what the semantics are. It is not "half your actual time out",
>> it is the *exact* frame rate you expect from your peer. The peer's
>> *only* task is to respect that frame rate, they do not need any more
>> information than that - they certainly don't need to guess at what your
>> margin for error is for keeping the connection open.
>>
>> The margin for error is an implementation detail that does *not* need
>> to be advertised. Double the frame rate is a reasonable general
>> recommendation, but the ideal margin depends on latency variability in
>> the system, you can imagine systems where it might be better to have a
>> larger or a smaller margin.
>>
>> The name 'idle-time-out' is not an ideal choice, 'heartbeat' or 'max-
>> frame-delay' might be better. But once we know what it means, the
>> semantics are perfectly clear.
>>
>> > Knowing the exact enforced timeout could certainly be clearer in some
>> > ways, but if the period you had to send after wasnt defined (e.g
>> > half)
>> > then it would essentially leave you with the same problem as now:
>> > deciding exactly how early you should send heartbeat frames to avoid
>> > a
>> > [more-precisely-known] timeout.
>>
>> The only important thing is to agree on the meaning of the advertised
>> value: is it the frame rate or the connection close threshold? The work
>>  of adjusting at one end or the other is equivalent.
>>
>> I think we agree it is currently defined as the frame rate, I see no
>> benefit in trying to change the meaning. (There would be no drawback
>> either if the spec was not already published and implemented by
>> multiple implementors, but it is)
>>
>> >
>> > With what the spec says we effectively have a lower-bound on the
>> > actual-timeout (if cases the peer didnt half their actual-timeout
>> > before advertising a number) and an upper bound (if they did) due to
>> > the use of SHOULD as opposed to MUST or some other more fixed
>> > definition of behaviour. So we currently try to send frames often
>> > enough to satisfy that lower bound, i.e the number we received. That
>> > essentially reduces it to the the same problem as if we were trying
>> > to
>> > satisfy a theoretical exactly-known actual-timeout (just using a
>> > smaller number that we dont want to use, because we followed the
>> > specs
>> > recommendation). Currently we do this by sending frames after half
>> > the
>> > advertised period, i.e. what we know is actually a quarter of the
>> > actual-timeout enforced by peers such as proton which are following
>> > the specs recommendations. We could choose to do it less often than
>> > that by reducing how conservative it is, e.g use 75+% of advertised
>> > value instead for example, which should have no impact in the cases
>> > where peers had followed the recommendation (as we would still be
>> > sending more than twice as quickly as needed to satisfy their
>> > actual-timeout), but would put us closer to spurious timeout ( if e.g
>> > there was a delay in delivering/processing the heartbeat) against any
>> > peers that didn't follow the recommendation.
>> >
>> > On 28 September 2016 at 21:26, Justin Ross <justin.r...@gmail.com>
>> > wrote:
>> > >
>> > > IMO, the overall picture is simpler, and easier to explain to third
>> > > parties, if we go the way Ken suggested.  When a remote peer sends
>> > > you an
>> > > idle timeout value, it is an expression of an (actual, not simply
>> > > "advertised") guarantee - "I will expire the connection after X
>> > > time
>> > > without receiving a frame from you".
>> > >
>> > > We could also legitimately go the direction you suggest.  *But* its
>> > > name is
>> > > "idle timeout".  We can't easily change the name.  I think we
>> > > should take
>> > > the spec text that goes with the name, and the behavior of our
>> > > components,
>> > > firmly in the direction Ken suggests.
>> > >
>> > > Off topic: why is this on the dev list?
>> > >
>> > >
>> > > On Wed, Sep 28, 2016 at 12:36 PM, Alan Conway <acon...@redhat.com>
>> > > wrote:
>> > >
>> > > >
>> > > > On Wed, 2016-09-28 at 10:13 -0400, Ken Giusti wrote:
>> > > > >
>> > > > > I've had a hand in the way Proton/C interprets the meaning of
>> > > > > 'idle-
>> > > > > timeout' and I've never liked the solution.  I think Proton/C's
>> > > > > behavior is not 'pessimistic' as much as it is 'conservative'
>> > > > > for the
>> > > > > sake of interoperability.  This, unfortunately ends up with a
>> > > > > needless idle frame chattiness when both ends are Proton-based.
>> > > > >
>> > > > > ----- Original Message -----
>> > > > > >
>> > > > > >
>> > > > > > From: "Rob Godfrey" <rob.j.godf...@gmail.com>
>> > > > > > To: "qpid" <dev@qpid.apache.org>
>> > > > > > Sent: Wednesday, September 28, 2016 6:19:05 AM
>> > > > > > Subject: Re: API and terms: idle-time-out and heartbeat
>> > > > > > intervals.
>> > > > > >
>> > > > > > I agree that specifying that the communicated figure should
>> > > > > > be
>> > > > > > "half"
>> > > > > > the "actual" timeout was a mistake.
>> > > > > >
>> > > > > > What the spec should have tried to communicate is that the
>> > > > > > sender
>> > > > > > should communicate a value somewhat less than the period it
>> > > > > > uses to
>> > > > > > determine that the connection has actually timed-out to allow
>> > > > > > for
>> > > > > > the
>> > > > > > receiver to process and emit a heartbeat frame.
>> > > > >
>> > > > >
>> > > > > Wouldn't it be much clearer to simply send the _actual_ idle
>> > > > > timeout
>> > > > > value?
>> > > >
>> > > > My read is that is exactly what it does: It sends the max time
>> > > > that the
>> > > > *sender* of frames may be idle. The receiver of frames SHOULD be
>> > > > more
>> > > > patient than that. The wording of the "discussion" around it and
>> > > > the
>> > > > choice of terms is a bit cloudy but, the text that describes
>> > > > idle-time-
>> > > > out seems clear enough: it is the max interval between sending
>> > > > frames.
>> > > > The frame receiver SHOULD wait longer that that before closing,
>> > > > and 2x
>> > > > seems a reasonable suggestion, but that's for the impl to decide.
>> > > >
>> > > > It's weird that it says "idle-time-out should be half the
>> > > > threshold"
>> > > > instead of "the threshold should be twice the idle-time-out" but
>> > > > it's
>> > > > logically equivalent.
>> > > >
>> > > >
>> > > > >
>> > > > > Having the spec suggest "communicating a value *somewhat less*"
>> > > >
>> > > > The wording is odd but the semantics are you communicate
>> > > > *exactly* the
>> > > > max frame delay you want and then you SHOULD set your connection
>> > > > close
>> > > > threshold to something bigger. The other end doesn't need to know
>> > > > how
>> > > > much bigger, they just need to know what rate to send frames.
>> > > >
>> > > > >
>> > > > > [emphasis mine] leaves the implementation open for
>> > > > > interpretation -
>> > > > > which is exactly how we got into this mess in the first
>> > > > > place.  Developers are a smart bunch - they know that keep
>> > > > > alive
>> > > > > traffic will have to be sent frequently enough to prevent idle
>> > > > > timeout.
>> > > > >
>> > > > >
>> > > > > >
>> > > > > >
>> > > > > >  Similarly the sender
>> > > > > > should ensure that a frame has been emitted well within the
>> > > > > > timeout
>> > > > > > period to allow for any communication / processing delay.
>> > > > >
>> > > > > Agreed - perfectly acceptable for the spec to point this out.
>> > > > >
>> > > > > >
>> > > > > >
>> > > > > >  In practice
>> > > > > > these "wiggle room" factors should not be determined by the
>> > > > > > application level timeout setting but by sensible
>> > > > > > calculations on
>> > > > > > transport delay variance / processing time, etc...  these
>> > > > > > calculation
>> > > > > > may differ between different use-cases / environments (for
>> > > > > > example
>> > > > > > in
>> > > > > > a low latency / real-time environment you may be able to make
>> > > > > > hard
>> > > > > > guarantees about the number of milliseconds that
>> > > > > > communication /
>> > > > > > processing delay will take... on the other hand if you are
>> > > > > > using an
>> > > > > > interpreted language with stop-the-world garbage collection
>> > > > > > you may
>> > > > > > not be able to say much better than the delay should be less
>> > > > > > than
>> > > > > > 30s
>> > > > > > or whatever).
>> > > > > >
>> > > > >
>> > > > > Yes - very important things to keep in mind when implementing
>> > > > > this.  But the spec shouldn't be making these suggestions for
>> > > > > different implementation options. The spec should be as concise
>> > > > > as
>> > > > > possible about the mandated behavior, and leave the
>> > > > > implementation to
>> > > > > the developers.
>> > > > >
>> > > > > >
>> > > > > >
>> > > > > > I think application level APIs should be in terms of the
>> > > > > > timeouts
>> > > > > > that
>> > > > > > will affect the application.  The AMQP library should be
>> > > > > > massaging
>> > > > > > those numbers in such a way that they can fulfil the
>> > > > > > application
>> > > > > > requirements.
>> > > > > >
>> > > > >
>> > > > > Agreed.  Now, is there _any_ way we can suggest an update to
>> > > > > the
>> > > > > spec?  Perhaps an errata, etc?
>> > > > >
>> > > > > >
>> > > > > >
>> > > > > > -- Rob
>> > > > > >
>> > > > > > On 28 September 2016 at 10:42, Robbie Gemmell <robbie.gemmell
>> > > > > > @gmail
>> > > > > > .com>
>> > > > > > wrote:
>> > > > > > >
>> > > > > > >
>> > > > > > > On 27 September 2016 at 22:24, Alan Conway <aconway@redhat.
>> > > > > > > com>
>> > > > > > > wrote:
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Tue, 2016-09-27 at 15:37 -0400, Alan Conway wrote:
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > I want to clarify and document the meaning of these
>> > > > > > > > > terms for
>> > > > > > > > > our
>> > > > > > > > > APIs,
>> > > > > > > > > presently I can't find anywhere where they are
>> > > > > > > > > documented
>> > > > > > > > > clearly.
>> > > > > > > > >
>> > > > > > > > > The AMQP spec says: "Each peer has its own
>> > > > > > > > > (independent) idle
>> > > > > > > > > timeout.
>> > > > > > > > > At connection open each peer communicates the maximum
>> > > > > > > > > period between activity (frames) on the connection that
>> > > > > > > > > it
>> > > > > > > > > desires
>> > > > > > > > > from
>> > > > > > > > > its partner.The open frame carries the idletime-out
>> > > > > > > > > field for this purpose. To avoid spurious timeouts, the
>> > > > > > > > > value
>> > > > > > > > > in
>> > > > > > > > > idle-
>> > > > > > > > > time-out SHOULD be half the peer’s
>> > > > > > > > > actual timeout threshold."
>> > > > > > > > >
>> > > > > > > > > In other words: if I send you an "open" frame with
>> > > > > > > > > idle-time-
>> > > > > > > > > out=N
>> > > > > > > > > that
>> > > > > > > > > means *you* should not wait for longer than N
>> > > > > > > > > milliseconds to
>> > > > > > > > > send a
>> > > > > > > > > frame to me. It does not mean *I* will close the
>> > > > > > > > > connection
>> > > > > > > > > after N
>> > > > > > > > > milliseconds, I SHOULD be more patient and wait for N*2
>> > > > > > > > > ms to
>> > > > > > > > > avoid
>> > > > > > > > > closing prematurely due to minor timing wobbles.
>> > > > > > > > >
>> > > > > > > > > I think the choice of name is slightly ambiguous but
>> > > > > > > > > the spec
>> > > > > > > > > is
>> > > > > > > > > clear
>> > > > > > > > > on the semantics, so it's important to document it to
>> > > > > > > > > remove
>> > > > > > > > > the
>> > > > > > > > > ambiguity.
>> > > > > > > > >
>> > > > > > > > > Anybody disagree?
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > > Sigh. Sadly proton-C interprets "idle-timeout"
>> > > > > > > > differently
>> > > > > > > > depending on
>> > > > > > > > which end of the connection you are on:
>> > > > > > > >
>> > > > > > > >       // as per the recommendation in the spec, advertise
>> > > > > > > > half
>> > > > > > > > our
>> > > > > > > >       // actual timeout to the remote
>> > > > > > > >       const pn_millis_t idle_timeout = transport-
>> > > > > > > > >
>> > > > > > > > > local_idle_timeout
>> > > > > > > >           ? (transport->local_idle_timeout/2)
>> > > > > > > >           : 0;
>> > > > > > > >
>> > > > > > > > So in proton, pn_set_idle_timeout does NOT mean set the
>> > > > > > > > AMQP
>> > > > > > > > idle-
>> > > > > > > > timeout value, it means set the local "receive timeout"
>> > > > > > > > value
>> > > > > > > > and send
>> > > > > > > > half that as the AMQP "send timeout" for the peer.
>> > > > > > > >
>> > > > > > > > I'm tempted to use a new term in the Go API: "heartbeat".
>> > > > > > > > To me
>> > > > > > > > that
>> > > > > > > > clearly means the "send timeout" (hearts beat, they don't
>> > > > > > > > listen for
>> > > > > > > > beats) so it coincides with the meaning of the AMQP
>> > > > > > > > "idle-
>> > > > > > > > timeout", but
>> > > > > > > > without the ambiguity that is exacerbated by proton
>> > > > > > > > interpreting it
>> > > > > > > > both ways.
>> > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > > > Proton may seem to behave differently on each end, but I
>> > > > > > > don't
>> > > > > > > think
>> > > > > > > its necessarily a bad thing that it does, and it is also I
>> > > > > > > think
>> > > > > > > largely just reflecting an annoying bit in the spec around
>> > > > > > > this
>> > > > > > > where
>> > > > > > > different behaviours are allowed for, whereas it would be
>> > > > > > > easier
>> > > > > > > if it
>> > > > > > > had less wiggle room.
>> > > > > > >
>> > > > > > > The transport setter/getter for the local timeout takes the
>> > > > > > > 'actual
>> > > > > > > timeout' and then sends half of it as the advertised value
>> > > > > > > in the
>> > > > > > > Open
>> > > > > > > sent. This makes a certain amount of sense since it ensures
>> > > > > > > that
>> > > > > > > appropriate behaviour is actually satisfied, rather than
>> > > > > > > expecting the
>> > > > > > > user to ensure they only give half the value they really
>> > > > > > > want for
>> > > > > > > their actual timeout. The getter for the remote timeout
>> > > > > > > value on
>> > > > > > > the
>> > > > > > > other hand returns the advertised value from the Open that
>> > > > > > > is
>> > > > > > > received. I expect it does that since it cant actually ever
>> > > > > > > return the
>> > > > > > > remotes 'actual timeout' without making an assumption, i.e
>> > > > > > > that
>> > > > > > > they
>> > > > > > > did in fact advertise half (or less) of their actual
>> > > > > > > timeout,
>> > > > > > > which
>> > > > > > > the spec only says that they SHOULD do.
>> > > > > > >
>> > > > > > > Yes the local setter taking the advertised value may have
>> > > > > > > been
>> > > > > > > better
>> > > > > > > for method consistency with the remote getter. On the other
>> > > > > > > hand,
>> > > > > > > sending of necessary heartbeats is handled directly by the
>> > > > > > > transport
>> > > > > > > during the tick process, so users may not necessarily even
>> > > > > > > use
>> > > > > > > the
>> > > > > > > getter themselves, and proton uses that remote value
>> > > > > > > internally
>> > > > > > > by
>> > > > > > > pessimistically halfing it to account for the case that
>> > > > > > > folks on
>> > > > > > > the
>> > > > > > > other end did not advertise half their actual timeout
>> > > > > > > (since the
>> > > > > > > spec
>> > > > > > > doesnt require that they do). Side note: proton could
>> > > > > > > arguably be
>> > > > > > > less
>> > > > > > > pessimistic here and go for say a percentage much nearer
>> > > > > > > the full
>> > > > > > > advertised value, but then you'd probably need to start
>> > > > > > > guaging
>> > > > > > > how
>> > > > > > > close is too close.
>> > > > > > >
>> > > > > > > I think ensuring the doccumentation on the methods is clear
>> > > > > > > what
>> > > > > > > they
>> > > > > > > do is sufficient enough here. I actually prefer idle-
>> > > > > > > timeout as
>> > > > > > > an
>> > > > > > > name rather than heartbeat due to the way this all works.
>> > > > > > > Since
>> > > > > > > you
>> > > > > > > only tell the other side [half] your timeout, you dont
>> > > > > > > actually
>> > > > > > > have
>> > > > > > > direct control over when they send any needed empty frames
>> > > > > > > to
>> > > > > > > satisfy
>> > > > > > > it (as the above shows, we might send them more often than
>> > > > > > > they
>> > > > > > > require) and 'heartbeat' might seem to imply that you do,
>> > > > > > > and
>> > > > > > > possibly
>> > > > > > > even that they need be sent at that period all the time
>> > > > > > > even
>> > > > > > > despite
>> > > > > > > regular traffic, which is not the case.
>> > > > > > >
>> > > > > > > Robbie
>> > > > > > >
>> > > > > > > ---------------------------------------------------------
>> > > > > > > ------
>> > > > > > > ------
>> > > > > > > To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
>> > > > > > > For additional commands, e-mail: dev-h...@qpid.apache.org
>> > > > > > >
>> > > > > >
>> > > > > > -----------------------------------------------------------
>> > > > > > ------
>> > > > > > ----
>> > > > > > To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
>> > > > > > For additional commands, e-mail: dev-h...@qpid.apache.org
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > > ---------------------------------------------------------------
>> > > > ------
>> > > > To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
>> > > > For additional commands, e-mail: dev-h...@qpid.apache.org
>> > > >
>> > > >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
>> > For additional commands, e-mail: dev-h...@qpid.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
>> For additional commands, e-mail: dev-h...@qpid.apache.org
>>
>>
>
> --
> -K
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@qpid.apache.org
> For additional commands, e-mail: users-h...@qpid.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org

Re: API and terms: idle-time-out and heartbeat intervals.

Reply via email to