zentol commented on pull request #16357:
URL: https://github.com/apache/flink/pull/16357#issuecomment-879891771


   > This entails that the time until marking someone as dead is threshold * 
heartbeat interval. I hope that this definition is easy enough to understand 
for our users.
   
   yes, I think that is easy to understand. Restricting it to heartbeat RPCs 
also ties the whole story together; you have heartbeats as the _one_ mechanism 
to detect dead machines, with the timeouts handling unresponsive TMs (i.e., 
those blocked by some long-running operation or running exceedingly slow), 
while the threshold handles TMs that are already dead.
   
   > Hence, I would suggest to go with a default of 1 heartbeat rpc loss until 
marking a target as dead.
   
   We can try that out, but I am apprehensive about the default of 1 (purely 
because expecting every message to go through is like _the_ thing everyone 
drills you not to do).  
   
   > The problem is that the heartbeat is used to transport status information 
about the components. Since we require a certain order of rpcs, we cannot send 
the heartbeat signals easily from outside the main thread because it could lead 
to race conditions and outdated status information. One could try to use some 
logical clocks to synchronize the messages again, but this hasn't been tried 
yet.
   
   We could also think of decoupling heartbeats and periodic status updates 
again. It would result in more RPCs, but the total amount of data may become 
less (depending on how low the previous heartbeat interval was).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to