[GitHub] [flink] zentol commented on pull request #16357: [FLINK-23209] Introduce HeartbeatListener.notifyTargetUnreachable

GitBox Tue, 13 Jul 2021 06:54:56 -0700


zentol commented on pull request #16357:
URL: https://github.com/apache/flink/pull/16357#issuecomment-879109408



   > Therefore, my suggestion would be to introduce a threshold for lost 
heartbeat messages before triggering a timeout/mark the TaskExecutor as dead.
   
   I'm curious as to how this threshold would be implemented considering that 
there is no reliable pattern for how many/often RPC messages are sent.
   It seems like any approach where X failed messages in Y time would behave 
undesirable in the extreme cases, such as 1 RPC per Y, or hundreds of RPCs 
within Y/1000. I guess it would have to be something like "If all (but at least 
LOWER_THRESHOLD) messages in Y time could not be delivered then the target is 
considered unreachable."
   
   > Of course, we can also think about more complex models where all different 
signals are integrated to produce a confidence value and then for some value we 
say that the target is now "dead".
   
   I think that would become too complex and difficult to understand to the 
user. (buy maybe I imagine something overly complicated)
   However, I do like the idea of combining the information of both heartbeats 
and other mechanisms; something along the lines of "if 1 heartbeat was not 
answered within akka.ask.timeout and no RPC (with at least LOWER_THRESHOLD 
attempts) went through within the timeframe (heartbeat_start -> 
heartbeat_timeout), then we consider the target unreachable."; essentially we 
shortcut the heartbeat timeout, instead of having an entirely separate 
mechanism.
   
   > Hence, one could argue that heartbeating could mark their targets as dead 
after a single lost heartbeat message because it won't produce more 
false-positives as Flink's data-layer will already produce.
   
   That only applies if there are actual data connections between task 
executors; it does not apply to jobs without shuffles, jobs that are currently 
being deployed (potentially crashing the TM in the process), nor cases where 
the JM/all TMs go down.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] zentol commented on pull request #16357: [FLINK-23209] Introduce HeartbeatListener.notifyTargetUnreachable

Reply via email to