Folks, I've found an interesting thing about delays on node failed detection depending on discovery type.
For example, when we have 33 nodes cluster (33 vms in the cloud in my case), we have the following failure detection delays: ~ 10 seconds on TcpDiscovery ~ 8.5 seconds on ZookeeperDiscovery when we dropping connections via iptables (emulating network issues, GC, Hangs, etc), But when we stop node via SIGKILL we have the unexpected following: ~ 0.5 seconds on TcpDiscovery ~ 7.5!!! seconds on ZookeeperDiscovery TcpDiscovery handles failure faster on SIGKILL because it periodically (~2 times per second) checks that socket is alive and when it sees it closed there is no reason to wait for anything else. ZookeeperDiscovery waits up to the 10 seconds specified at default failureDetectionTimeout and has no socket-alive checks. Should we change default failureDetectionTimeout/zookeeperSessionDuration when ZookeeperDiscovery is used to have at least the same failure detection performance as at TcpDiscovery? Are there any other ways to tune ZookeeperDiscovery defaults? Are there any chances to have socket-alive checks when ZookeeperDiscovery used?