TcpDiscovery detects NONE_LEFT faster than ZookeeperDiscovery in SIGKILL case

Anton Vinogradov Wed, 18 Nov 2020 05:24:24 -0800

Folks,

I've found an interesting thing about delays on node failed detection
depending on discovery type.


For example, when we have 33 nodes cluster (33 vms in the cloud in my
case), we have the following failure detection delays:
~ 10 seconds on TcpDiscovery
~ 8.5 seconds on ZookeeperDiscovery
when we dropping connections via iptables (emulating network issues, GC,
Hangs, etc),

But when we stop node via SIGKILL we have the unexpected following:
~ 0.5 seconds on TcpDiscovery
~ 7.5!!! seconds on ZookeeperDiscovery

TcpDiscovery handles failure faster on SIGKILL because it periodically (~2
times per second) checks that socket is alive and when it sees it closed
there is no reason to wait for anything else.
ZookeeperDiscovery waits up to the 10 seconds specified at default
failureDetectionTimeout and has no socket-alive checks.

Should we change default failureDetectionTimeout/zookeeperSessionDuration
when ZookeeperDiscovery is used to have at least the same failure detection
performance as at TcpDiscovery?

Are there any other ways to tune ZookeeperDiscovery defaults?

Are there any chances to have socket-alive checks when ZookeeperDiscovery
used?

TcpDiscovery detects NONE_LEFT faster than ZookeeperDiscovery in SIGKILL case

Reply via email to