Seems there's some dispute over the meaning of idle-time-out in the AMQP spec, not just how proton treats it. Let me try to convince you.
Forget proton for a moment, here's the spec text with my comments. Remember: the spec defines the AMQP *protocol*, the important text is where it talks specifically about protocol fields and constructs. The key problem is that "idle timeout" is ambiguous: Does it mean the time the frame receiver will wait before closing, or the max time the frame sender is allowed to wait before sending? I claim that while the English words "idle timeout" are used sloppily in the spec, the statements about the formal protocol value idle-time-out are clear: idle-time-out is the frame interval, not the connection close threshold. ==== [spec] Connections are subject to an idle timeout threshold. The timeout is triggered by a local peer when no frames are received after a threshold value is exceeded. The idle timeout is measured in milliseconds, and starts from the time the last frame is received. If the threshold is exceeded, then a peer SHOULD try to gracefully close the connection using a close frame with an error explaining why. If the remote peer does not respond gracefully within a threshold to this, then the peer MAY close the TCP socket. [ac] This paragraph is sloppy: it uses "idle timeout", "idle timeout threshold" and "threshold" without defining any of them. However it has not mentioned any formal AMQP constructs as yet, so I claim this is non-normative discussion. [spec] Each peer has its own (independent) idle timeout. At connection open each peer communicates the maximum period between activity (frames) on the connection that it desires from its partner. [ac] This is clear: the *communicated* value (on the open frame) is the maximum period *between frames*, it is *not* the time threshold before a connection is closed. [spec] The open frame carries the idle-time-out field for this purpose. To avoid spurious timeouts, the value in idle-time-out SHOULD be half the peer’s actual timeout threshold. [ac] Here there is a clear distinction between the formal idle-time-out (frame interval) and the informal "timeout threshold" at which the connection is closed. Since the idle-time-out SHOULD be half the threshold, clearly it *is not* the threshold. It would have been clearer to say "the connection close threshold should be twice the idle-time-out" [snip irrelevant stuff] [spec] If during operation a peer exceeds the remote peer’s idle timeout’s threshold, e.g., because it is heavily loaded, it SHOULD gracefully close the connection by using a close frame with an error explaining why [ac] Sloppy again, but does not make any statement about the formal meaning of idle-time-out. It talks about "thresholds" which refer to the close threshold. ==== With one exception (in paragraph 1) the word "threshold" is always used (sometimes in "idle timeout threshold") when referring to the connection close threshold. I agree that the first and last paragraphs are unclear, and the English words "idle timeout" are over-used in confusing ways. However I think all the text that refers directly to the formal idle-time-out value is clear: idle-time-out is the inter- frame interval, NOT the connection close threshold. On Wed, 2016-09-28 at 10:13 -0400, Ken Giusti wrote: > I've had a hand in the way Proton/C interprets the meaning of 'idle- > timeout' and I've never liked the solution. I think Proton/C's > behavior is not 'pessimistic' as much as it is 'conservative' for the > sake of interoperability. This, unfortunately ends up with a > needless idle frame chattiness when both ends are Proton-based. > > ----- Original Message ----- > > > > From: "Rob Godfrey" <[email protected]> > > To: "qpid" <[email protected]> > > Sent: Wednesday, September 28, 2016 6:19:05 AM > > Subject: Re: API and terms: idle-time-out and heartbeat intervals. > > > > I agree that specifying that the communicated figure should be > > "half" > > the "actual" timeout was a mistake. > > > > What the spec should have tried to communicate is that the sender > > should communicate a value somewhat less than the period it uses to > > determine that the connection has actually timed-out to allow for > > the > > receiver to process and emit a heartbeat frame. > > > Wouldn't it be much clearer to simply send the _actual_ idle timeout > value? Having the spec suggest "communicating a value *somewhat > less*" [emphasis mine] leaves the implementation open for > interpretation - which is exactly how we got into this mess in the > first place. Developers are a smart bunch - they know that keep > alive traffic will have to be sent frequently enough to prevent idle > timeout. > > > > > > Similarly the sender > > should ensure that a frame has been emitted well within the timeout > > period to allow for any communication / processing delay. > > Agreed - perfectly acceptable for the spec to point this out. > > > > > In practice > > these "wiggle room" factors should not be determined by the > > application level timeout setting but by sensible calculations on > > transport delay variance / processing time, etc... these > > calculation > > may differ between different use-cases / environments (for example > > in > > a low latency / real-time environment you may be able to make hard > > guarantees about the number of milliseconds that communication / > > processing delay will take... on the other hand if you are using an > > interpreted language with stop-the-world garbage collection you may > > not be able to say much better than the delay should be less than > > 30s > > or whatever). > > > > Yes - very important things to keep in mind when implementing > this. But the spec shouldn't be making these suggestions for > different implementation options. The spec should be as concise as > possible about the mandated behavior, and leave the implementation to > the developers. > > > > > I think application level APIs should be in terms of the timeouts > > that > > will affect the application. The AMQP library should be massaging > > those numbers in such a way that they can fulfil the application > > requirements. > > > > Agreed. Now, is there _any_ way we can suggest an update to the > spec? Perhaps an errata, etc? > > > > > -- Rob > > > > On 28 September 2016 at 10:42, Robbie Gemmell <robbie.gemmell@gmail > > .com> > > wrote: > > > > > > On 27 September 2016 at 22:24, Alan Conway <[email protected]> > > > wrote: > > > > > > > > On Tue, 2016-09-27 at 15:37 -0400, Alan Conway wrote: > > > > > > > > > > I want to clarify and document the meaning of these terms for > > > > > our > > > > > APIs, > > > > > presently I can't find anywhere where they are documented > > > > > clearly. > > > > > > > > > > The AMQP spec says: "Each peer has its own (independent) idle > > > > > timeout. > > > > > At connection open each peer communicates the maximum > > > > > period between activity (frames) on the connection that it > > > > > desires > > > > > from > > > > > its partner.The open frame carries the idletime-out > > > > > field for this purpose. To avoid spurious timeouts, the value > > > > > in > > > > > idle- > > > > > time-out SHOULD be half the peer’s > > > > > actual timeout threshold." > > > > > > > > > > In other words: if I send you an "open" frame with idle-time- > > > > > out=N > > > > > that > > > > > means *you* should not wait for longer than N milliseconds to > > > > > send a > > > > > frame to me. It does not mean *I* will close the connection > > > > > after N > > > > > milliseconds, I SHOULD be more patient and wait for N*2 ms to > > > > > avoid > > > > > closing prematurely due to minor timing wobbles. > > > > > > > > > > I think the choice of name is slightly ambiguous but the spec > > > > > is > > > > > clear > > > > > on the semantics, so it's important to document it to remove > > > > > the > > > > > ambiguity. > > > > > > > > > > Anybody disagree? > > > > > > > > > > > > > Sigh. Sadly proton-C interprets "idle-timeout" differently > > > > depending on > > > > which end of the connection you are on: > > > > > > > > // as per the recommendation in the spec, advertise half > > > > our > > > > // actual timeout to the remote > > > > const pn_millis_t idle_timeout = transport- > > > > >local_idle_timeout > > > > ? (transport->local_idle_timeout/2) > > > > : 0; > > > > > > > > So in proton, pn_set_idle_timeout does NOT mean set the AMQP > > > > idle- > > > > timeout value, it means set the local "receive timeout" value > > > > and send > > > > half that as the AMQP "send timeout" for the peer. > > > > > > > > I'm tempted to use a new term in the Go API: "heartbeat". To me > > > > that > > > > clearly means the "send timeout" (hearts beat, they don't > > > > listen for > > > > beats) so it coincides with the meaning of the AMQP "idle- > > > > timeout", but > > > > without the ambiguity that is exacerbated by proton > > > > interpreting it > > > > both ways. > > > > > > > > > > > > > > Proton may seem to behave differently on each end, but I don't > > > think > > > its necessarily a bad thing that it does, and it is also I think > > > largely just reflecting an annoying bit in the spec around this > > > where > > > different behaviours are allowed for, whereas it would be easier > > > if it > > > had less wiggle room. > > > > > > The transport setter/getter for the local timeout takes the > > > 'actual > > > timeout' and then sends half of it as the advertised value in the > > > Open > > > sent. This makes a certain amount of sense since it ensures that > > > appropriate behaviour is actually satisfied, rather than > > > expecting the > > > user to ensure they only give half the value they really want for > > > their actual timeout. The getter for the remote timeout value on > > > the > > > other hand returns the advertised value from the Open that is > > > received. I expect it does that since it cant actually ever > > > return the > > > remotes 'actual timeout' without making an assumption, i.e that > > > they > > > did in fact advertise half (or less) of their actual timeout, > > > which > > > the spec only says that they SHOULD do. > > > > > > Yes the local setter taking the advertised value may have been > > > better > > > for method consistency with the remote getter. On the other hand, > > > sending of necessary heartbeats is handled directly by the > > > transport > > > during the tick process, so users may not necessarily even use > > > the > > > getter themselves, and proton uses that remote value internally > > > by > > > pessimistically halfing it to account for the case that folks on > > > the > > > other end did not advertise half their actual timeout (since the > > > spec > > > doesnt require that they do). Side note: proton could arguably be > > > less > > > pessimistic here and go for say a percentage much nearer the full > > > advertised value, but then you'd probably need to start guaging > > > how > > > close is too close. > > > > > > I think ensuring the doccumentation on the methods is clear what > > > they > > > do is sufficient enough here. I actually prefer idle-timeout as > > > an > > > name rather than heartbeat due to the way this all works. Since > > > you > > > only tell the other side [half] your timeout, you dont actually > > > have > > > direct control over when they send any needed empty frames to > > > satisfy > > > it (as the above shows, we might send them more often than they > > > require) and 'heartbeat' might seem to imply that you do, and > > > possibly > > > even that they need be sent at that period all the time even > > > despite > > > regular traffic, which is not the case. > > > > > > Robbie > > > > > > --------------------------------------------------------------- > > > ------ > > > To unsubscribe, e-mail: [email protected] > > > For additional commands, e-mail: [email protected] > > > > > > > ----------------------------------------------------------------- > > ---- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
