tillrohrmann commented on pull request #16357: URL: https://github.com/apache/flink/pull/16357#issuecomment-879855212
I think in the first version the threshold would solely be defined for the heartbeat rpcs (e.g. if the `HeartbeatManager` fails to send `x` heartbeat rpcs, then mark the target as dead). This entails that the time until marking someone as dead is `threshold * heartbeat interval`. I hope that this definition is easy enough to understand for our users. >That only applies if there are actual data connections between task executors; it does not apply to jobs without shuffles, jobs that are currently being deployed (potentially crashing the TM in the process), nor cases where the JM/all TMs go down. Agreed that in some cases this might make the cluster a tad bit less stable but given that Flink is able to run jobs with shuffles, it seems that Flink runs often enough in a stable-enough environment these days. Hence, I would suggest to go with a default of `1` heartbeat rpc loss until marking a target as dead. If users start to complain, we can still disable it. I would leave more complex models for future improvements. >By the way, similar to k8s' liveness probe, maybe it's better to respond heartbeat requests asap rather than queue it with other RPC requests, which I prefer to use as readiness probe. In this way, responding of heartbeat requests in a certain time can be more predictable, therefore the result of timeout is more reliable, and timeout can possibly be decreased. Though the connection loss is still not a reliable signal. But since I'm not experienced in akka, I can't tell if it's possible or easy to implement. The problem is that the heartbeat is used to transport status information about the components. Since we require a certain order of rpcs, we cannot send the heartbeat signals easily from outside the main thread because it could lead to race conditions and outdated status information. One could try to use some logical clocks to synchronize the messages again, but this hasn't been tried yet. > This is quoted from your reply in the mail list. I'd like to make sure if this is a long term plan or is going to take actions right now. If it's in recent plan, the heartbeat mechanism may also need to be more reliable. There are no immediate plans for adding this feature but it will eventually be required. So long story short, I would be in favour of adding a threshold for lost heartbeat rpcs until a target is marked as dead. The default should be `1` and it is configurable by the user if it is too aggressive on their environment. If it turns out that too many users are running Flink on unstable environments, then we might have to increase this value in the future. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
