[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ZOOKEEPER-3698:
--------------------------------------
    Labels: pull-request-available  (was: )

> NoRouteToHostException when starting large ZooKeeper cluster on localhost
> -------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3698
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3698
>             Project: ZooKeeper
>          Issue Type: Bug
>            Reporter: Mate Szalay-Beko
>            Assignee: Mate Szalay-Beko
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.6.0
>
>
> During testing RC for 3.6.0, we found that ZooKeeper cluster with large 
> number of ensemble members (e.g. 23) can not start properly. We see a lot of 
> warnings in the log:
> {code:java}
> 2020-01-15 20:02:13,431 [myid:13] - WARN
>  [ListenerHandler-phunt-MBP13.local/192.168.1.91:4193:QuorumCnxManager@691]
> - None of the addresses (/192.168.1.91:4190) are reachable for sid 10
> java.net.NoRouteToHostException: No valid address among [/192.168.1.91:4190]
> {code}
>  and also:
> {code:java}
> 2020-01-17 11:02:26,177 [myid:4] - WARN  
> [Thread-2531:QuorumCnxManager$SendWorker@1269] - destination address 
> /127.0.0.1 not reachable anymore, shutting down the SendWorker for sid 6
> {code}
> The exceptions are happening when the new MultiAddress feature tries to 
> filter the unreachable hosts from the address list. This involves the calling 
> of the InetAddress.isReachable method with a default timeout of 500ms, which 
> goes down to a native call in java and basically try to do a ping (an ICMP 
> echo request) to the host. Naturally, the localhost should be always 
> reachable. For some reason, this call gets failed (timeouted or simly 
> refused) on mac if we have many ensemble members. I tested with 9 members and 
> the cluster started properly. With 11-13-15 members it took more and more 
> time to get the cluster to start, and the "NoRouteToHostException" started to 
> appear in the logs. After around 1 minute the 15 ensemble members cluster 
> started, but obviously this is not good this way. (I also tried with JDK 11 
> but the I found the same behaviour)
>  
> On linux, I haven't been able to reproduce the problem. I tried with 5, 9, 15 
> and 23 ensemble members and the quorum always seems to start properly in a 
> few seconds. (I used OpenJDK 1.8.232 on Ubuntu 18.04)
> *Update*:
> On mac, we we have the ICMP rate limit set to 250 by default. You can turn 
> this off using the following command: sudo sysctl -w net.inet.icmp.icmplim=0
>  (see [https://krypted.com/mac-os-x/disable-icmp-rate-limiting-os-x/])
> Using the above command before starting the 23 ensemble members cluster 
> locally seems to solve the problem for me. (can someone verify?) The question 
> is if this workaround is enough or not.
> As far as I can tell, the current code will generate {{2*A*(M-1)}} ICMP calls 
> in each ZooKeeper server during startup, if {{'M'}} is the number of ensemble 
> members and {{'A'}} is the number of election addresses provided for each 
> member. This is not that high, if each ZooKeeper server is started on a 
> different machine, but if we start a lot of ZooKeeper servers on a single 
> machine, then it can quickly go beyond the predefined limit of 250 for mac.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to