[
https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17105395#comment-17105395
]
Mate Szalay-Beko edited comment on ZOOKEEPER-3814 at 5/12/20, 1:27 PM:
-----------------------------------------------------------------------
update: I was wrong, the order of the rolling restart seems not to be
important. I got the same error simply by:
- have server.1, server.2, server.3 up and running
- stop server.3
- start server.4 with the new config (but re-using the data and config folder
of server.3)
I think the problem is that {{server.3}} was committed locally somehow to the
last valid view of the quorum. And when {{server.4}} comes up, it get the
{{server.3}} from somewhere. Interestingly, it doesn't get it from
{{zoo.cfg.dynamic.next}}.
When I do the following test, I still got the same problem:
- have server.1, server.2, server.3 up and running
- stop server.3
- delete {{zoo.cfg.dynamic.next}} from the config folder of server 3/4
- start server.4 with the new config (but re-using the data and config folder
of server.3)
- at this point I still see the same errors in the log + I also notice that
the freshly generated {{zoo.cfg.dynamic.next}} is still wrong.
I was trying to reproduce the same steps with 3.4.14 and I haven't got any
errors like these. So this really seems to be a bug (or at least something that
shouldn't happen / should have been documented... we should be backward
compatible, especially when dynamic config is disabled). I need to dig into the
code now to find out the problem.
was (Author: symat):
update: I was wrong, the order of the rolling restart seems not to be
important. I got the same error simply by:
- have server.1, server.2, server.3 up and running
- stop server.3
- start server.4 with the new config (but re-using the data and config folder
of server.3)
I think the problem is that {{server.3}} was committed locally somehow to the
last valid view of the quorum. And when {{server.4}} comes up, it get the
{{server.3}} from somewhere. Interestingly, it doesn't get it from
{{zoo.cfg.dynamic.next}}.
When I do the following test, I still got the same problem:
- have server.1, server.2, server.3 up and running
- stop server.3
- delete {{zoo.cfg.dynamic.next}} from the config folder of server 3/4
- start server.4 with the new config (but re-using the data and config folder
of server.3)
- at this point I still see the same errors in the log + I also notice that the
freshly generated {{zoo.cfg.dynamic.next}} is still wrong.
I need to dig into the code now to find out the problem. But this really seems
to be a bug (or at least something that shouldn't happen when dynamic config is
disabled).
> ZooKeeper caching of config
> ---------------------------
>
> Key: ZOOKEEPER-3814
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814
> Project: ZooKeeper
> Issue Type: Bug
> Components: leaderElection, quorum, server
> Affects Versions: 3.5.6
> Reporter: Rajkiran Sura
> Assignee: Mate Szalay-Beko
> Priority: Major
>
> Hello,
> We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6.
> Encountered no issues as such.
> This is how the ZooKeeper config looks like:
> {quote}tickTime=2000
> dataDir=/zookeeper-data/
> initLimit=5
> syncLimit=2
> maxClientCnxns=2048
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> 4lw.commands.whitelist=stat, ruok, conf, isro, mntr
> authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
> requireClientAuthScheme=sasl
> quorum.cnxn.threads.size=20
> quorum.auth.enableSasl=true
> quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
> quorum.auth.learnerRequireSasl=true
> quorum.auth.learner.saslLoginContext=QuorumLearner
> quorum.auth.serverRequireSasl=true
> quorum.auth.server.saslLoginContext=QuorumServer
> server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> server.22=node5.bar.com:2888:3888;2181
> {quote}
> Post upgrade, we had to migrate server.22 on the same node, but with
> *FOO*.bar.com domain name due to kerberos referral issues. And, we used
> different server-identifier, i.e., *23* when we migrated. So, here is how the
> new config looked like:
> {quote}server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181*
> {quote}
> We restarted all the nodes in the ensemble with the above updated config. And
> the migrated node joined the quorum successfully and was serving all clients
> directly connected to it, without any issues.
> Recently, when a leader election happened,
> server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has
> highest ID). But then, ZooKeeper was unable to serve any clients and *all*
> the servers were _somehow still_ trying to establish a channel to 22 (old DNS
> name: node5.bar.com) and were throwing below error in a loop:
> {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN
> [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve
> address: node4.bar.com}}
> {{java.net.UnknownHostException: node5.bar.com: Name or service not known}}
> {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}}
> {{ at
> java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}}
> {{ at
> java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}}
> {{ at
> java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}}
> {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}}
> {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}}
> {{ at
> org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}}
> {{ at
> org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}}
> {{ at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}}
> {{ at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
> {{ at
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
> {{ at
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
> {{ at java.base/java.lang.Thread.run(Thread.java:834)}}
> {{2020-05-02 01:43:03,026 [myid:23] - WARN
> [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at
> election address node5.bar.com:3888}}
> {{java.net.UnknownHostException: node5.bar.com}}
> {{ at
> java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)}}
> {{ at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403)}}
> {{ at java.base/java.net.Socket.connect(Socket.java:591)}}
> {{ at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:650)}}
> {{ at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:714)}}
> {{ at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
> {{ at
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
> {{ at
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
> {{ at java.base/java.lang.Thread.run(Thread.java:834)}}
> {quote}
> Fetching config from live ZooKeeper znode also doesn't show "*22*" being a
> member of the ensemble. Its not clear how "22" is still coming into the
> picture.
> {quote}In [4]: zk.get('/zookeeper/config')
> Out[4]:
> ('server.17=node1.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> server.19=node2.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> server.20=node3.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> server.21=node4.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> server.23=node5.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> version=0',
> ZnodeStat(czxid=0, mzxid=0, ctime=0, mtime=1588399290245, version=-1,
> cversion=0, aversion=-1, ephemeralOwner=0, dataLength=360, numChildren=0,
> pzxid=0))
> {quote}
> We suspected some weird caching issue and restarted ZooKeeper across all the
> nodes but that didn't help. So, whenever node5 becomes the Leader, ID:22 is
> popping up. We even rebooted node5 and that hasn't helped too.
> We also looked at '/zookeeper/config' content from snapshot files and did not
> find any reference to ID:22.
> Any help would be greatly appreciated.
> NOTE: dynamic config is disabled.
> Thanks,
> Rajkiran
--
This message was sent by Atlassian Jira
(v8.3.4#803005)