TLDR:
During testing RC for 3.6.0, we found that ZooKeeper cluster with large
number of ensemble members (e.g. 23) can not start properly. This issue
seems to happen only on mac, and a workaround is to disable the ICMP
throttling. The question is if this workaround is enough for the RC, or if
we should change the code in ZooKeeper to limit the number of ICMP requests.


The problem:

On linux, I haven't been able to reproduce the problem. I tried with 5, 9,
15 and 23 ensemble members and the quorum always seems to start properly in
a few seconds. (I used OpenJDK 1.8.232 on Ubuntu 18.04)

On mac, the problem is consistently happening for large ensembles. The
server is very slow to start and we see a lot of warnings in the log like
these:

2020-01-15 20:02:13,431 [myid:13] - WARN
 [ListenerHandler-phunt-MBP13.local/192.168.1.91:4193:QuorumCnxManager@691]
- None of the addresses (/192.168.1.91:4190) are reachable for sid 10
java.net.NoRouteToHostException: No valid address among [/192.168.1.91:4190]

2020-01-17 11:02:26,177 [myid:4] - WARN
 [Thread-2531:QuorumCnxManager$SendWorker@1269] - destination address /
127.0.0.1 not reachable anymore, shutting down the SendWorker for sid 6

The exception is happening when the new MultiAddress feature tries to
filter the unreachable hosts from the address list when trying to decide
which election address to connect. This involves the calling of the
InetAddress.isReachable method with a default timeout of 500ms, which goes
down to a native call in java and basically try to do a ping (an ICMP echo
request) to the host. Naturally, the localhost should be always reachable.
This call gets timeouted on mac if we have many ensemble members. I tested
with 9 members and the cluster started properly. With 11-13-15 members it
took more and more time to get the cluster to start, and the
"NoRouteToHostException" started to appear in the logs. After around 1
minute the 15 ensemble members cluster started, but obviously this is way
too long.

On mac, we we have the ICMP rate limit set to 250 by default. You can turn
this off using the following command: sudo sysctl -w
net.inet.icmp.icmplim=0
(see https://krypted.com/mac-os-x/disable-icmp-rate-limiting-os-x/)

Using the above command before starting the 23 ensemble members cluster
locally seems to solve the problem for me. (can someone verify?) The
question is if this workaround is enough or not.

As far as I can tell, the current code will generate 2*A*(M-1) ICMP calls
in each ZooKeeper server during startup, if 'X' is the number of ensemble
members and 'A' is the number of election addresses provided for each
member. This is not that high, if each ZooKeeper server is started on a
different machine, but if we start a lot of ZooKeeper servers on a single
machine, then it can quickly go beyond the predefined limit of 250 for mac.

OPTION 1: we keep the code as it is. we might change the documentation for
zkconf mentioning this mac specific issue and the way how to disable the
ICMP rate limit.

OPTION 2: we change the code not to filter the list of election addresses
if the list has only a single element. This seems to be a logical way to
decrease the ICMP requests. However, if we would run a large number of
ZooKeeper servers on a single machine using multiple election addresses for
each server, we would get the same problem (most probably even quicker)

OPTION 3: make the address filtering configurable and change zkconf to
disable it by default. (but disabling will make the quorum potentially
unable to recover during network failures, so it is not recommended during
production)

OPTION 4: refactor the MultiAddress feature and remove the ICMP calls from
the ZooKeeper code. However, it is clearly helps for the quick recovery
during network failures... at the moment I can't think any good solution to
avoid it.

Kind regards,
Mate

Reply via email to