[
https://issues.apache.org/jira/browse/ZOOKEEPER-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061525#comment-17061525
]
Mate Szalay-Beko commented on ZOOKEEPER-3758:
---------------------------------------------
I was able to test the rolling upgrade with 3 servers using these scripts:
https://github.com/symat/zk-rolling-upgrade-test
I haven’t tried it with 5 servers yet, but I don’t think the number of servers
would be an issue here.
But I have a hypothesis. I think in your environment the ICMP (ping) traffic is
blocked by some firewall / OS setting and this can cause that you hit a bug in
the code. Since ZooKeeper 3.6.0 you can specify multiple addresses for each
ZooKeeper server instance (this can increase availability when multiple
physical network interfaces can be used parallel in the cluster). ZooKeeper
will perform ICMP ECHO requests or try to establish a TCP connection on port 7
(Echo) of the destination host in order to find the reachable addresses. This
should happen only if you provide multiple addresses in the configuration, in
your case ZooKeeper shouldn’t do any ICMP requests. But in the code I found,
that it just might do it and if ZooKeeper can not reach the current leader
using ICMP, then it would explain the exception you see.
Fortunately there is a workaround you can apply by setting the
{{multiAddress.reachabilityCheckEnabled=false}} in zoo.cfg or by using the {{-D
zookeeper.multiAddress.reachabilityCheckEnabled=false}} system property. This
should turn off the ICMP check regardless if you provide a single or multiple
addresses. Can you please try this parameter? If it helps you, then it verifies
my theory and I can provide a quick fix.
You can verify the current value of this config by looking for the following
INFO log message in the ZooKeeper logs: “multiAddress.reachabilityCheckEnabled
set to “.
> Update from 3.5.7 to 3.6.0 does not work
> ----------------------------------------
>
> Key: ZOOKEEPER-3758
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3758
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Reporter: Agostino Sarubbo
> Assignee: Mate Szalay-Beko
> Priority: Major
>
> Hello,
> we have a cluster with 5 zookeeper servers. We tried the update from 3.5.7
> to 3.6.0 but it does not work.
> We got the following:
> {code:java}
> 2020-03-16 10:40:45,514 [myid:1] - INFO
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@863]
> - Peer state changed: looking 2020-03-16 10:40:45,514 [myid:1] - WARN
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1501]
> - PeerState set to LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1371]
> - LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):FastLeaderElection@931]
> - New election. My id = 1, proposed zxid=0x0 2020-03-16 10:40:45,515
> [myid:1] - INFO
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] -
> Notification: my state:LOOKING; n.sid:1, n.state:LOOKING , n.leader:1,
> n.round:0x1b, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2,
> n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] -
> Notification: my state:LOOKING; n.sid:2, n.state:FOLLOWI NG, n.leader:4,
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b00000004, message format
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] -
> Notification: my state:LOOKING; n.sid:3, n.state:FOLLOWI NG, n.leader:4,
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b00000004, message format
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] -
> Notification: my state:LOOKING; n.sid:5, n.state:FOLLOWI NG, n.leader:4,
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b00000004, message format
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] -
> Notification: my state:LOOKING; n.sid:4, n.state:LEADING , n.leader:4,
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b00000004, message format
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@857]
> - Peer state changed: following 2020-03-16 10:40:45,518 [myid:1] - INFO
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1453]
> - FOLLOWING 2020-03-16 10:40:45,518 [myid:1] - INFO
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1246]
> - minSessionTimeout set to 4000 2020-03-16 10:40:45,518 [myid:1] - INFO
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1255]
> - maxSessionTimeout set to 40000 2020-03-16 10:40:45,519 [myid:1] - INFO
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45]
> - Response cache size is initialized with value 400. 2020-03-16 10:40:45,519
> [myid:1] - INFO
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45]
> - Response cache size is initialized with value 400. 2020-03-16 10:40:45,519
> [myid:1] - INFO
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@111]
> - zookeeper.pathStats.slotCapacity = 60 2020-03-16 10:40:45,519 [myid:1] -
> INFO
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@112]
> - zookeeper.pathStats.slotDuration = 15 2020-03-16 10:40:45,519 [myid:1] -
> INFO
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@113]
> - zookeeper.pathStats.maxDepth = 6 2020-03-16 10:40:45,519 [myid:1] - INFO
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@114]
> - zookeeper.pathStats.initialDelay = 5 2020-03-16 10:40:45,519 [myid:1] -
> INFO
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@115]
> - zookeeper.pathStats.delay = 5 2020-03-16 10:40:45,519 [myid:1] - INFO
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@116]
> - zookeeper.pathStats.enabled = false 2020-03-16 10:40:45,519 [myid:1] -
> INFO
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1470]
> - The max bytes for all large requests are set t o 104857600 2020-03-16
> 10:40:45,519 [myid:1] - INFO
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1484]
> - The large request threshold is set to -1 2020-03-16 10:40:45,519 [myid:1]
> - INFO
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@329]
> - Created server with tickTime 2000 minSessionTim eout 4000
> maxSessionTimeout 40000 clientPortListenBacklog -1 datadir
> /opt/loway/zookeeper/logs/version-2 snapdir
> /opt/loway/zookeeper/data/version-2 2020-03-16 10:40:45,519 [myid:1] - INFO
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):Follower@75] -
> FOLLOWING - LEADER ELECTION TOOK - 4 MS 2020-03-16 10:40:45,519 [myid:1] -
> INFO
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@863]
> - Peer state changed: following - discovery 2020-03-16 10:40:46,521 [myid:1]
> - WARN
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):Follower@129] -
> Exception when following the leader java.lang.IllegalArgumentException
> at
> java.util.concurrent.ThreadPoolExecutor.<init>(ThreadPoolExecutor.java:1314)
> at
> java.util.concurrent.ThreadPoolExecutor.<init>(ThreadPoolExecutor.java:1202)
> at java.util.concurrent.Executors.newFixedThreadPool(Executors.java:89)
> at
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:275)
> at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:87)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1455)
> 2020-03-16 10:40:46,521 [myid:1] - INFO
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):Follower@292] -
> shutdown Follower{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)