[ https://issues.apache.org/jira/browse/ZOOKEEPER-3698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019297#comment-17019297 ]
Mate Szalay-Beko commented on ZOOKEEPER-3698: --------------------------------------------- I can agree with 3) and 4). Regarding 3) I will add to the documentation, that disabling the reachability check will cause the cluster not to be able to reconfigure itself properly during network problems, so the disabling is useful only during testing. Also I don't think 1), the parallel stream would be an issue here. Regarding 2) I think it won't hurt, but won't really help either. I tested it by raising the hardcoded timeout value up to 5 sec, and it didn't solve the problem. (My hypothesis is that the ICMP calls might fail quickly on mac, not waiting / retrying during the timeout period.) Still, being able to fine-tune this parameter might be a good idea on production environment. I also agree with the slightly higher default value of 1 sec. Exactly because of the parallel stream, we should get back the address from the list with the quickest ping time. > NoRouteToHostException when starting large ZooKeeper cluster on localhost > ------------------------------------------------------------------------- > > Key: ZOOKEEPER-3698 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3698 > Project: ZooKeeper > Issue Type: Bug > Reporter: Mate Szalay-Beko > Assignee: Mate Szalay-Beko > Priority: Major > Fix For: 3.6.0 > > > During testing RC for 3.6.0, we found that ZooKeeper cluster with large > number of ensemble members (e.g. 23) can not start properly. We see a lot of > warnings in the log: > {code:java} > 2020-01-15 20:02:13,431 [myid:13] - WARN > [ListenerHandler-phunt-MBP13.local/192.168.1.91:4193:QuorumCnxManager@691] > - None of the addresses (/192.168.1.91:4190) are reachable for sid 10 > java.net.NoRouteToHostException: No valid address among [/192.168.1.91:4190] > {code} > > The exception is happening when the new MultiAddress feature tries to filter > the unreachable hosts from the address list. This involves the calling of the > InetAddress.isReachable method with a default timeout of 500ms, which goes > down to a native call in java and basically try to do a ping (an ICMP echo > request) to the host. Naturally, the localhost should be always reachable. > For some reason, this call gets timeouted on mac if we have many ensemble > members. I tested with 9 members and the cluster started properly. With > 11-13-15 members it took more and more time to get the cluster to start, and > the "NoRouteToHostException" started to appear in the logs. After around 1 > minute the 15 ensemble members cluster started, but obviously this is not > good this way. (I also tried with JDK 11 but the I found the same behaviour) > > On linux, I haven't been able to reproduce the problem. I tried with 5, 9, 15 > and 23 ensemble members and the quorum always seems to start properly in a > few seconds. (I used OpenJDK 1.8.232 on Ubuntu 18.04) -- This message was sent by Atlassian Jira (v8.3.4#803005)