[
https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110998#comment-17110998
]
Mate Szalay-Beko edited comment on ZOOKEEPER-3814 at 5/19/20, 12:09 PM:
------------------------------------------------------------------------
[~rajsura] I created a PR (https://github.com/apache/zookeeper/pull/1356) for
ZOOKEEPER-3829 and using this patch, I was able to do the following steps:
* have server.1, server.2, server.3, server.4, server.5 up and running
* stop server.5
* stop server.1
* start server.1 with the new config (removing server.5, adding server.6 with
the new hostname)
* stop server.2
* start server.2 with the new config
* stop server.3
* start server.3 with the new config
* stop server.4
* start server.4 with the new config
* start server.6 with the new config (but re-using the data folder of server.5)
during these steps, the cluster was up and running, always had at least 3
members. In the end I checked the logfiles of server.6 and I haven't seen any
attempt to try to connect to server.5.
I also tried a different sequence (although I think it makes less sense):
* have server.1, server.2, server.3, server.4, server.5 up and running
* stop server.5
* start server.6 with the new config (removing server.5, adding server.6 with
the new hostname), re-using the data folder of server.5
* stop server.1
* start server.1 with the new config
* stop server.2
* start server.2 with the new config
* stop server.3
* start server.3 with the new config
* stop server.4
* start server.4 with the new config
* stop server.6
* start server.6 with the new config
In this case I saw that server.6 was still trying to connect to server.5 after
the first restart, but never after the second restart. I don't consider this a
big deal, as I don't really think that this is a good sequence anyway. I think
it is more logical to restart all the other nodes (to have their config
updated) before I would start the new server.6.
Could you please share the sequence of steps you were executing when you saw
the original issue?
was (Author: symat):
[~rajsura] I created a PR (https://github.com/apache/zookeeper/pull/1356) for
ZOOKEEPER-3829 and using this patch, I was able to do the following steps:
* have server.1, server.2, server.3, server.4, server.5 up and running
* stop server.5
* stop server.1
* start server.1 with the new config (removing server.5, adding server.6 with
the new hostname)
* stop server.2
* start server.2 with the new config
* stop server.3
* start server.3 with the new config
* stop server.4
* start server.4 with the new config
* start server.6 with the new config (but re-using the data folder of server.5)
during these steps, the cluster was up and running, always had at least 3
members. In the end I checked the logfiles of server.6 and I haven't seen any
attempt to try to connect to server.5.
Could you please verify that this is the sequence of steps you executed?
> ZooKeeper caching of config
> ---------------------------
>
> Key: ZOOKEEPER-3814
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814
> Project: ZooKeeper
> Issue Type: Bug
> Components: leaderElection, quorum, server
> Affects Versions: 3.5.6
> Reporter: Rajkiran Sura
> Assignee: Mate Szalay-Beko
> Priority: Major
>
> Hello,
> We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6.
> Encountered no issues as such.
> This is how the ZooKeeper config looks like:
> {quote}tickTime=2000
> dataDir=/zookeeper-data/
> initLimit=5
> syncLimit=2
> maxClientCnxns=2048
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> 4lw.commands.whitelist=stat, ruok, conf, isro, mntr
> authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
> requireClientAuthScheme=sasl
> quorum.cnxn.threads.size=20
> quorum.auth.enableSasl=true
> quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
> quorum.auth.learnerRequireSasl=true
> quorum.auth.learner.saslLoginContext=QuorumLearner
> quorum.auth.serverRequireSasl=true
> quorum.auth.server.saslLoginContext=QuorumServer
> server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> server.22=node5.bar.com:2888:3888;2181
> {quote}
> Post upgrade, we had to migrate server.22 on the same node, but with
> *FOO*.bar.com domain name due to kerberos referral issues. And, we used
> different server-identifier, i.e., *23* when we migrated. So, here is how the
> new config looked like:
> {quote}server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181*
> {quote}
> We restarted all the nodes in the ensemble with the above updated config. And
> the migrated node joined the quorum successfully and was serving all clients
> directly connected to it, without any issues.
> Recently, when a leader election happened,
> server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has
> highest ID). But then, ZooKeeper was unable to serve any clients and *all*
> the servers were _somehow still_ trying to establish a channel to 22 (old DNS
> name: node5.bar.com) and were throwing below error in a loop:
> {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN
> [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve
> address: node4.bar.com}}
> {{java.net.UnknownHostException: node5.bar.com: Name or service not known}}
> {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}}
> {{ at
> java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}}
> {{ at
> java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}}
> {{ at
> java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}}
> {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}}
> {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}}
> {{ at
> org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}}
> {{ at
> org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}}
> {{ at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}}
> {{ at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
> {{ at
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
> {{ at
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
> {{ at java.base/java.lang.Thread.run(Thread.java:834)}}
> {{2020-05-02 01:43:03,026 [myid:23] - WARN
> [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at
> election address node5.bar.com:3888}}
> {{java.net.UnknownHostException: node5.bar.com}}
> {{ at
> java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)}}
> {{ at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403)}}
> {{ at java.base/java.net.Socket.connect(Socket.java:591)}}
> {{ at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:650)}}
> {{ at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:714)}}
> {{ at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
> {{ at
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
> {{ at
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
> {{ at java.base/java.lang.Thread.run(Thread.java:834)}}
> {quote}
> Fetching config from live ZooKeeper znode also doesn't show "*22*" being a
> member of the ensemble. Its not clear how "22" is still coming into the
> picture.
> {quote}In [4]: zk.get('/zookeeper/config')
> Out[4]:
> ('server.17=node1.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> server.19=node2.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> server.20=node3.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> server.21=node4.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> server.23=node5.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> version=0',
> ZnodeStat(czxid=0, mzxid=0, ctime=0, mtime=1588399290245, version=-1,
> cversion=0, aversion=-1, ephemeralOwner=0, dataLength=360, numChildren=0,
> pzxid=0))
> {quote}
> We suspected some weird caching issue and restarted ZooKeeper across all the
> nodes but that didn't help. So, whenever node5 becomes the Leader, ID:22 is
> popping up. We even rebooted node5 and that hasn't helped too.
> We also looked at '/zookeeper/config' content from snapshot files and did not
> find any reference to ID:22.
> Any help would be greatly appreciated.
> NOTE: dynamic config is disabled.
> Thanks,
> Rajkiran
--
This message was sent by Atlassian Jira
(v8.3.4#803005)