pltbkd commented on pull request #16357: URL: https://github.com/apache/flink/pull/16357#issuecomment-874719844
Till and I have discussed a little in the mail list on the topic, and I'd like to share a point I came up there. > I was just considering that, since the environment is shared by JM and TMs, and the connections among TMs (using netty) are flaky in unstable environments, which will also cause the job failure, is it necessary to build a strongly guaranteed connection between JM and TMs, or it could be as flaky as the connections among TMs? As far as I know, connections among TMs will just fail on their first connection loss, so behaving like this in JM just means "as flaky as connections among TMs". And here's the reply from Till in the mail list. > One simple approach could be to make the number of failed heartbeat RPCs until a target is marked as unreachable configurable because what represents a good enough criterion in one user's environment might produce too many false-positives in somebody else's environment. Or even simpler, one could say that one can disable reacting to a failed heartbeat RPC as it is currently the case. I think it's ok to introduce a new option to determine when JM can ensure a connection loss really means a TM loss. The option is in fact a part of heartbeat configuration, rather than akka configuration, and heartbeat options are all "expert" options, indicated by the document annotation. I suppose it's not a confusing option for experts. And of course we need to provide a proper default value for normal users. We may need another interval option if we use number of failed heartbeat RPCs as the new option. Current heartbeat interval is too long for it. Or maybe we can use another timeout option that we can keep retrying (with a non-configurable short delay) to connect to TM until the timeout. The main difference between it and current heartbeat timeout is that, current heartbeat timeout is for liveness probe, when we suppose the TM to be living, while the new timeout is for death probe, when we suppose the TM to be already dead, and we already have some evidences about this. So it can be shorter than heartbeat timeout, which means faster failure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
