zentol commented on pull request #16357:
URL: https://github.com/apache/flink/pull/16357#issuecomment-874532709


   > By default the heartbeat timeout is 50s while interval is 10s, so there 
may not be a significant improvement only to reduce the interval.
   
   Well one needs to ask themselves why it is that the timeout is multiple 
times the interval.
   
   If the timeout is that large because a target should truly only be 
considered unreachable if nothing got through during this entire period, then 
in any case both mechanism will work the same way (<= because users configure 
it that way).
   
   Beyond that, it is mostly for reliability isn't it; wouldn't want to treat a 
TM as unreachable on the off-chance that these exact 2 messages got lost in the 
network. But if other messages also function as heartbeats, then a user has 
more leeway to reduce the timeout, increasing detection speed without 
compromising reliability.
   
   > Considering that the heartbeat is also sent via akka, I'm not sure whether 
the heartbeat.timeout is working.
   
   The heartbeat timeout isn't applied per-message, essentially when you send a 
heartbeat request you start a timer for the heartbeat timeout. If it comes back 
in time you stop the timer. If another heartbeat request is triggered and a 
timer is already running, then you don't reset it.
   IOW; the heartbeat timeout describes how much time must pass during which 
_none_ of the heartbeats are acknowledged for the target to be considered 
unreachable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to