pltbkd commented on pull request #16357: URL: https://github.com/apache/flink/pull/16357#issuecomment-880364616
>This entails that the time until marking someone as dead is threshold * heartbeat interval. I hope that this definition is easy enough to understand for our users. I overall agree it is a good enough plan for the first version. Maybe we should suggest user not to set the threshold to more than timeout/interval, which by default is 50s/10s=5, a relative small value. We could add the suggestion in the configuration description, if you think it's necessary as well. >We can try that out, but I am apprehensive about the default of 1 (purely because expecting every message to go through is like the thing everyone drills you not to do). I agree with zentol and I'd like to suggest the default value to be 2, standing for discovering and confirming. >The problem is that the heartbeat is used to transport status information about the components. Since we require a certain order of rpcs, we cannot send the heartbeat signals easily from outside the main thread because it could lead to race conditions and outdated status information. One could try to use some logical clocks to synchronize the messages again, but this hasn't been tried yet. I now understand why the heartbeats also need to be in order. IMO, there are some advantages to have a more reliable and frequent heartbeat mechanism, with which I think both interval and timeout could be decreased. But since I don't know many details here, I would leave this issue to experts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
