[jira] [Comment Edited] (ZOOKEEPER-3814) ZooKeeper caching of config

Mate Szalay-Beko (Jira) Tue, 12 May 2020 06:29:02 -0700


    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17105395#comment-17105395
 ]


Mate Szalay-Beko edited comment on ZOOKEEPER-3814 at 5/12/20, 1:27 PM:
-----------------------------------------------------------------------

update: I was wrong, the order of the rolling restart seems not to be 
important. I got the same error simply by:
 - have server.1, server.2, server.3 up and running
 - stop server.3
 - start server.4 with the new config (but re-using the data and config folder 
of server.3)

I think the problem is that {{server.3}} was committed locally somehow to the 
last valid view of the quorum. And when {{server.4}} comes up, it get the 
{{server.3}} from somewhere. Interestingly, it doesn't get it from 
{{zoo.cfg.dynamic.next}}.

When I do the following test, I still got the same problem:
 - have server.1, server.2, server.3 up and running
 - stop server.3
 - delete {{zoo.cfg.dynamic.next}} from the config folder of server 3/4
 - start server.4 with the new config (but re-using the data and config folder 
of server.3)
 - at this point I still see the same errors in the log + I also notice that 
the freshly generated {{zoo.cfg.dynamic.next}} is still wrong.

I was trying to reproduce the same steps with 3.4.14 and I haven't got any 
errors like these. So this really seems to be a bug (or at least something that 
shouldn't happen / should have been documented... we should be backward 
compatible, especially when dynamic config is disabled). I need to dig into the 
code now to find out the problem. 


was (Author: symat):
update: I was wrong, the order of the rolling restart seems not to be 
important. I got the same error simply by:

- have server.1, server.2, server.3 up and running
- stop server.3
- start server.4 with the new config (but re-using the data and config folder 
of server.3)

I think the problem is that {{server.3}} was committed locally somehow to the 
last valid view of the quorum. And when {{server.4}} comes up, it get the 
{{server.3}} from somewhere. Interestingly, it doesn't get it from 
{{zoo.cfg.dynamic.next}}. 

When I do the following test, I still got the same problem:

- have server.1, server.2, server.3 up and running
- stop server.3
- delete {{zoo.cfg.dynamic.next}} from the config folder of server 3/4
- start server.4 with the new config (but re-using the data and config folder 
of server.3)
- at this point I still see the same errors in the log + I also notice that the 
freshly generated  {{zoo.cfg.dynamic.next}} is still wrong.

I need to dig into the code now to find out the problem. But this really seems 
to be a bug (or at least something that shouldn't happen when dynamic config is 
disabled).

> ZooKeeper caching of config
> ---------------------------
>
>                 Key: ZOOKEEPER-3814
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection, quorum, server
>    Affects Versions: 3.5.6
>            Reporter: Rajkiran Sura
>            Assignee: Mate Szalay-Beko
>            Priority: Major
>
> Hello,
> We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. 
> Encountered no issues as such.
> This is how the ZooKeeper config looks like:
> {quote}tickTime=2000
> dataDir=/zookeeper-data/
> initLimit=5
> syncLimit=2
> maxClientCnxns=2048
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> 4lw.commands.whitelist=stat, ruok, conf, isro, mntr
> authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
> requireClientAuthScheme=sasl
> quorum.cnxn.threads.size=20
> quorum.auth.enableSasl=true
> quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
> quorum.auth.learnerRequireSasl=true
> quorum.auth.learner.saslLoginContext=QuorumLearner
> quorum.auth.serverRequireSasl=true
> quorum.auth.server.saslLoginContext=QuorumServer
> server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> server.22=node5.bar.com:2888:3888;2181
> {quote}
> Post upgrade, we had to migrate server.22 on the same node, but with 
> *FOO*.bar.com domain name due to kerberos referral issues. And, we used 
> different server-identifier, i.e., *23* when we migrated. So, here is how the 
> new config looked like:
> {quote}server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181*
> {quote}
> We restarted all the nodes in the ensemble with the above updated config. And 
> the migrated node joined the quorum successfully and was serving all clients 
> directly connected to it, without any issues.
> Recently, when a leader election happened, 
> server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has 
> highest ID). But then, ZooKeeper was unable to serve any clients and *all* 
> the servers were _somehow still_ trying to establish a channel to 22 (old DNS 
> name: node5.bar.com) and were throwing below error in a loop:
> {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve 
> address: node4.bar.com}}
> {{java.net.UnknownHostException: node5.bar.com: Name or service not known}}
> {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}}
> {{ at 
> java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}}
> {{ at 
> java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}}
> {{ at 
> java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}}
> {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}}
> {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
> {{ at java.base/java.lang.Thread.run(Thread.java:834)}}
> {{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at 
> election address node5.bar.com:3888}}
> {{java.net.UnknownHostException: node5.bar.com}}
> {{ at 
> java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)}}
> {{ at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403)}}
> {{ at java.base/java.net.Socket.connect(Socket.java:591)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:650)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:714)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
> {{ at java.base/java.lang.Thread.run(Thread.java:834)}}
> {quote}
> Fetching config from live ZooKeeper znode also doesn't show "*22*" being a 
> member of the ensemble. Its not clear how "22" is still coming into the 
> picture.
> {quote}In [4]: zk.get('/zookeeper/config')
> Out[4]:
> ('server.17=node1.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> server.19=node2.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> server.20=node3.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> server.21=node4.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> server.23=node5.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> version=0',
>  ZnodeStat(czxid=0, mzxid=0, ctime=0, mtime=1588399290245, version=-1, 
> cversion=0, aversion=-1, ephemeralOwner=0, dataLength=360, numChildren=0, 
> pzxid=0))
> {quote}
> We suspected some weird caching issue and restarted ZooKeeper across all the 
> nodes but that didn't help. So, whenever node5 becomes the Leader, ID:22 is 
> popping up. We even rebooted node5 and that hasn't helped too.
> We also looked at '/zookeeper/config' content from snapshot files and did not 
> find any reference to ID:22.
> Any help would be greatly appreciated.
> NOTE: dynamic config is disabled.
> Thanks,
> Rajkiran



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ZOOKEEPER-3814) ZooKeeper caching of config

Reply via email to