[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061525#comment-17061525
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3758:
---------------------------------------------

I was able to test the rolling upgrade with 3 servers using these scripts: 
https://github.com/symat/zk-rolling-upgrade-test
I haven’t tried it with 5 servers yet, but I don’t think the number of servers 
would be an issue here.

But I have a hypothesis. I think in your environment the ICMP (ping) traffic is 
blocked by some firewall / OS setting and this can cause that you hit a bug in 
the code. Since ZooKeeper 3.6.0 you can specify multiple addresses for each 
ZooKeeper server instance (this can increase availability when multiple 
physical network interfaces can be used parallel in the cluster). ZooKeeper 
will perform ICMP ECHO requests or try to establish a TCP connection on port 7 
(Echo) of the destination host in order to find the reachable addresses. This 
should happen only if you provide multiple addresses in the configuration, in 
your case ZooKeeper shouldn’t do any ICMP requests. But in the code I found, 
that it just might do it and if ZooKeeper can not reach the current leader 
using ICMP, then it would explain the exception you see.

Fortunately there is a workaround you can apply by setting the 
{{multiAddress.reachabilityCheckEnabled=false}} in zoo.cfg or by using the {{-D 
zookeeper.multiAddress.reachabilityCheckEnabled=false}} system property. This 
should turn off the ICMP check regardless if you provide a single or multiple 
addresses. Can you please try this parameter? If it helps you, then it verifies 
my theory and I can provide a quick fix.

You can verify the current value of this config by looking for the following 
INFO log message in the ZooKeeper logs: “multiAddress.reachabilityCheckEnabled 
set to “.

> Update from 3.5.7 to 3.6.0 does not work
> ----------------------------------------
>
>                 Key: ZOOKEEPER-3758
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3758
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>            Reporter: Agostino Sarubbo
>            Assignee: Mate Szalay-Beko
>            Priority: Major
>
> Hello,
>  we have a cluster with 5 zookeeper servers. We tried the update from 3.5.7 
> to 3.6.0 but it does not work.
> We got the following:
> {code:java}
> 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@863] 
> - Peer state changed: looking 2020-03-16 10:40:45,514 [myid:1] - WARN  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1501] 
> - PeerState set to LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1371] 
> - LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):FastLeaderElection@931]
>  - New election. My id = 1, proposed zxid=0x0 2020-03-16 10:40:45,515 
> [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:1, n.state:LOOKING , n.leader:1, 
> n.round:0x1b, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:2, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b00000004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:3, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b00000004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:5, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b00000004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:4, n.state:LEADING , n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b00000004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@857] 
> - Peer state changed: following 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1453] 
> - FOLLOWING 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1246]
>  - minSessionTimeout set to 4000 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1255]
>  - maxSessionTimeout set to 40000 2020-03-16 10:40:45,519 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45]
>  - Response cache size is initialized with value 400. 2020-03-16 10:40:45,519 
> [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45]
>  - Response cache size is initialized with value 400. 2020-03-16 10:40:45,519 
> [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@111]
>  - zookeeper.pathStats.slotCapacity = 60 2020-03-16 10:40:45,519 [myid:1] - 
> INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@112]
>  - zookeeper.pathStats.slotDuration = 15 2020-03-16 10:40:45,519 [myid:1] - 
> INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@113]
>  - zookeeper.pathStats.maxDepth = 6 2020-03-16 10:40:45,519 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@114]
>  - zookeeper.pathStats.initialDelay = 5 2020-03-16 10:40:45,519 [myid:1] - 
> INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@115]
>  - zookeeper.pathStats.delay = 5 2020-03-16 10:40:45,519 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@116]
>  - zookeeper.pathStats.enabled = false 2020-03-16 10:40:45,519 [myid:1] - 
> INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1470]
>  - The max bytes for all large requests are set t o 104857600 2020-03-16 
> 10:40:45,519 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1484]
>  - The large request threshold is set to -1 2020-03-16 10:40:45,519 [myid:1] 
> - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@329]
>  - Created server with tickTime 2000 minSessionTim eout 4000 
> maxSessionTimeout 40000 clientPortListenBacklog -1 datadir 
> /opt/loway/zookeeper/logs/version-2 snapdir 
> /opt/loway/zookeeper/data/version-2 2020-03-16 10:40:45,519 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):Follower@75] - 
> FOLLOWING - LEADER ELECTION TOOK - 4 MS 2020-03-16 10:40:45,519 [myid:1] - 
> INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@863] 
> - Peer state changed: following - discovery 2020-03-16 10:40:46,521 [myid:1] 
> - WARN  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):Follower@129] - 
> Exception when following the leader java.lang.IllegalArgumentException        
> at 
> java.util.concurrent.ThreadPoolExecutor.<init>(ThreadPoolExecutor.java:1314)  
>       at 
> java.util.concurrent.ThreadPoolExecutor.<init>(ThreadPoolExecutor.java:1202)  
>       at java.util.concurrent.Executors.newFixedThreadPool(Executors.java:89) 
>        at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:275)  
>       at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:87)    
>     at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1455) 
> 2020-03-16 10:40:46,521 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):Follower@292] - 
> shutdown Follower{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to