[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018688#comment-17018688
 ] 

Enrico Olivelli commented on ZOOKEEPER-3698:
--------------------------------------------

Thinking a bit more....parallelStream() in this case is not the problem, we 
have only 1 address per peer, so parallelStream is not doing something special, 
it is using only one thread

> NoRouteToHostException when starting large ZooKeeper cluster on localhost
> -------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3698
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3698
>             Project: ZooKeeper
>          Issue Type: Bug
>            Reporter: Mate Szalay-Beko
>            Assignee: Mate Szalay-Beko
>            Priority: Major
>             Fix For: 3.6.0
>
>
> During testing RC for 3.6.0, we found that ZooKeeper cluster with large 
> number of ensemble members (e.g. 23) can not start properly. We see a lot of 
> warnings in the log:
> {code:java}
> 2020-01-15 20:02:13,431 [myid:13] - WARN
>  [ListenerHandler-phunt-MBP13.local/192.168.1.91:4193:QuorumCnxManager@691]
> - None of the addresses (/192.168.1.91:4190) are reachable for sid 10
> java.net.NoRouteToHostException: No valid address among [/192.168.1.91:4190]
> {code}
>  
> The exception is happening when the new MultiAddress feature tries to filter 
> the unreachable hosts from the address list. This involves the calling of the 
> InetAddress.isReachable method with a default timeout of 500ms, which goes 
> down to a native call in java and basically try to do a ping (an ICMP echo 
> request) to the host. Naturally, the localhost should be always reachable. 
> For some reason, this call gets timeouted on mac if we have many ensemble 
> members. I tested with 9 members and the cluster started properly. With 
> 11-13-15 members it took more and more time to get the cluster to start, and 
> the "NoRouteToHostException" started to appear in the logs. After around 1 
> minute the 15 ensemble members cluster started, but obviously this is not 
> good this way. (I also tried with JDK 11 but the I found the same behaviour)
>  
> On linux, I haven't been able to reproduce the problem. I tried with 5, 9, 15 
> and 23 ensemble members and the quorum always seems to start properly in a 
> few seconds. (I used OpenJDK 1.8.232 on Ubuntu 18.04)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to