pltbkd commented on pull request #16357:
URL: https://github.com/apache/flink/pull/16357#issuecomment-880364616


   >This entails that the time until marking someone as dead is threshold * 
heartbeat interval. I hope that this definition is easy enough to understand 
for our users.
   
   I overall agree it is a good enough plan for the first version. Maybe we 
should suggest user not to set the threshold to more than timeout/interval, 
which by default is 50s/10s=5, a relative small value. We could add the 
suggestion in the configuration description, if you think it's necessary as 
well.
   
   >We can try that out, but I am apprehensive about the default of 1 (purely 
because expecting every message to go through is like the thing everyone drills 
you not to do).
   
   I agree with zentol and I'd like to suggest the default value to be 2, 
standing for discovering and confirming.
   
   >The problem is that the heartbeat is used to transport status information 
about the components. Since we require a certain order of rpcs, we cannot send 
the heartbeat signals easily from outside the main thread because it could lead 
to race conditions and outdated status information. One could try to use some 
logical clocks to synchronize the messages again, but this hasn't been tried 
yet.
   
   I now understand why the heartbeats also need to be in order. IMO, there are 
some advantages to have a more reliable and frequent heartbeat mechanism, with 
which I think both interval and timeout could be decreased. But since I don't 
know many details here, I would leave this issue to experts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to