zentol commented on pull request #16357: URL: https://github.com/apache/flink/pull/16357#issuecomment-879109408
> Therefore, my suggestion would be to introduce a threshold for lost heartbeat messages before triggering a timeout/mark the TaskExecutor as dead. I'm curious as to how this threshold would be implemented considering that there is no reliable pattern for how many/often RPC messages are sent. It seems like any approach where X failed messages in Y time would behave undesirable in the extreme cases, such as 1 RPC per Y, or hundreds of RPCs within Y/1000. I guess it would have to be something like "If all (but at least LOWER_THRESHOLD) messages in Y time could not be delivered then the target is considered unreachable." > Of course, we can also think about more complex models where all different signals are integrated to produce a confidence value and then for some value we say that the target is now "dead". I think that would become too complex and difficult to understand to the user. (buy maybe I imagine something overly complicated) However, I do like the idea of combining the information of both heartbeats and other mechanisms; something along the lines of "if 1 heartbeat was not answered within akka.ask.timeout and no RPC (with at least LOWER_THRESHOLD attempts) went through within the timeframe (heartbeat_start -> heartbeat_timeout), then we consider the target unreachable."; essentially we shortcut the heartbeat timeout, instead of having an entirely separate mechanism. > Hence, one could argue that heartbeating could mark their targets as dead after a single lost heartbeat message because it won't produce more false-positives as Flink's data-layer will already produce. That only applies if there are actual data connections between task executors; it does not apply to jobs without shuffles, jobs that are currently being deployed (potentially crashing the TM in the process), nor cases where the JM/all TMs go down. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
