tillrohrmann commented on pull request #16357:
URL: https://github.com/apache/flink/pull/16357#issuecomment-879855212


   I think in the first version the threshold would solely be defined for the 
heartbeat rpcs (e.g. if the `HeartbeatManager` fails to send `x` heartbeat 
rpcs, then mark the target as dead). This entails that the time until marking 
someone as dead is `threshold * heartbeat interval`. I hope that this 
definition is easy enough to understand for our users.
   
   >That only applies if there are actual data connections between task 
executors; it does not apply to jobs without shuffles, jobs that are currently 
being deployed (potentially crashing the TM in the process), nor cases where 
the JM/all TMs go down.
   
   Agreed that in some cases this might make the cluster a tad bit less stable 
but given that Flink is able to run jobs with shuffles, it seems that Flink 
runs often enough in a stable-enough environment these days. Hence, I would 
suggest to go with a default of `1` heartbeat rpc loss until marking a target 
as dead. If users start to complain, we can still disable it.
   
   I would leave more complex models for future improvements.
   
   >By the way, similar to k8s' liveness probe, maybe it's better to respond 
heartbeat requests asap rather than queue it with other RPC requests, which I 
prefer to use as readiness probe. In this way, responding of heartbeat requests 
in a certain time can be more predictable, therefore the result of timeout is 
more reliable, and timeout can possibly be decreased. Though the connection 
loss is still not a reliable signal. But since I'm not experienced in akka, I 
can't tell if it's possible or easy to implement.
   
   The problem is that the heartbeat is used to transport status information 
about the components. Since we require a certain order of rpcs, we cannot send 
the heartbeat signals easily from outside the main thread because it could lead 
to race conditions and outdated status information. One could try to use some 
logical clocks to synchronize the messages again, but this hasn't been tried 
yet.
   
   > This is quoted from your reply in the mail list. I'd like to make sure if 
this is a long term plan or is going to take actions right now. If it's in 
recent plan, the heartbeat mechanism may also need to be more reliable.
   
   There are no immediate plans for adding this feature but it will eventually 
be required.
   
   So long story short, I would be in favour of adding a threshold for lost 
heartbeat rpcs until a target is marked as dead. The default should be `1` and 
it is configurable by the user if it is too aggressive on their environment. If 
it turns out that too many users are running Flink on unstable environments, 
then we might have to increase this value in the future. What do you think?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to