zentol commented on pull request #16357: URL: https://github.com/apache/flink/pull/16357#issuecomment-874532709
> By default the heartbeat timeout is 50s while interval is 10s, so there may not be a significant improvement only to reduce the interval. Well one needs to ask themselves why it is that the timeout is multiple times the interval. If the timeout is that large because a target should truly only be considered unreachable if nothing got through during this entire period, then in any case both mechanism will work the same way (<= because users configure it that way). Beyond that, it is mostly for reliability isn't it; wouldn't want to treat a TM as unreachable on the off-chance that these exact 2 messages got lost in the network. But if other messages also function as heartbeats, then a user has more leeway to reduce the timeout, increasing detection speed without compromising reliability. > Considering that the heartbeat is also sent via akka, I'm not sure whether the heartbeat.timeout is working. The heartbeat timeout isn't applied per-message, essentially when you send a heartbeat request you start a timer for the heartbeat timeout. If it comes back in time you stop the timer. If another heartbeat request is triggered and a timer is already running, then you don't reset it. IOW; the heartbeat timeout describes how much time must pass during which _none_ of the heartbeats are acknowledged for the target to be considered unreachable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
