[jira] [Comment Edited] (ZOOKEEPER-3814) ZooKeeper caching of config

Mate Szalay-Beko (Jira) Tue, 19 May 2020 05:10:11 -0700


    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110998#comment-17110998
 ]


Mate Szalay-Beko edited comment on ZOOKEEPER-3814 at 5/19/20, 12:09 PM:
------------------------------------------------------------------------

[~rajsura] I created a PR (https://github.com/apache/zookeeper/pull/1356) for 
ZOOKEEPER-3829 and using this patch, I was able to do the following steps:

* have server.1, server.2, server.3, server.4, server.5 up and running
* stop server.5
* stop server.1
* start server.1 with the new config (removing server.5, adding server.6 with 
the new hostname)
* stop server.2
* start server.2 with the new config
* stop server.3
* start server.3 with the new config
* stop server.4
* start server.4 with the new config
* start server.6 with the new config (but re-using the data folder of server.5)

during these steps, the cluster was up and running, always had at least 3 
members. In the end I checked the logfiles of server.6 and I haven't seen any 
attempt to try to connect to server.5.


I also tried a different sequence (although I think it makes less sense):

* have server.1, server.2, server.3, server.4, server.5 up and running
* stop server.5
* start server.6 with the new config (removing server.5, adding server.6 with 
the new hostname), re-using the data folder of server.5
* stop server.1
* start server.1 with the new config
* stop server.2
* start server.2 with the new config
* stop server.3
* start server.3 with the new config
* stop server.4
* start server.4 with the new config
* stop server.6
* start server.6 with the new config

In this case I saw that server.6 was still trying to connect to server.5 after 
the first restart, but never after the second restart. I don't consider this a 
big deal, as I don't really think that this is a good sequence anyway. I think 
it is more logical to restart all the other nodes (to have their config 
updated) before I would start the new server.6.

Could you please share the sequence of steps you were executing when you saw 
the original issue?




was (Author: symat):
[~rajsura] I created a PR (https://github.com/apache/zookeeper/pull/1356) for 
ZOOKEEPER-3829 and using this patch, I was able to do the following steps:

* have server.1, server.2, server.3, server.4, server.5 up and running
* stop server.5
* stop server.1
* start server.1 with the new config (removing server.5, adding server.6 with 
the new hostname)
* stop server.2
* start server.2 with the new config
* stop server.3
* start server.3 with the new config
* stop server.4
* start server.4 with the new config
* start server.6 with the new config (but re-using the data folder of server.5)

during these steps, the cluster was up and running, always had at least 3 
members. In the end I checked the logfiles of server.6 and I haven't seen any 
attempt to try to connect to server.5.

Could you please verify that this is the sequence of steps you executed?

> ZooKeeper caching of config
> ---------------------------
>
>                 Key: ZOOKEEPER-3814
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection, quorum, server
>    Affects Versions: 3.5.6
>            Reporter: Rajkiran Sura
>            Assignee: Mate Szalay-Beko
>            Priority: Major
>
> Hello,
> We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. 
> Encountered no issues as such.
> This is how the ZooKeeper config looks like:
> {quote}tickTime=2000
> dataDir=/zookeeper-data/
> initLimit=5
> syncLimit=2
> maxClientCnxns=2048
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> 4lw.commands.whitelist=stat, ruok, conf, isro, mntr
> authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
> requireClientAuthScheme=sasl
> quorum.cnxn.threads.size=20
> quorum.auth.enableSasl=true
> quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
> quorum.auth.learnerRequireSasl=true
> quorum.auth.learner.saslLoginContext=QuorumLearner
> quorum.auth.serverRequireSasl=true
> quorum.auth.server.saslLoginContext=QuorumServer
> server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> server.22=node5.bar.com:2888:3888;2181
> {quote}
> Post upgrade, we had to migrate server.22 on the same node, but with 
> *FOO*.bar.com domain name due to kerberos referral issues. And, we used 
> different server-identifier, i.e., *23* when we migrated. So, here is how the 
> new config looked like:
> {quote}server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181*
> {quote}
> We restarted all the nodes in the ensemble with the above updated config. And 
> the migrated node joined the quorum successfully and was serving all clients 
> directly connected to it, without any issues.
> Recently, when a leader election happened, 
> server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has 
> highest ID). But then, ZooKeeper was unable to serve any clients and *all* 
> the servers were _somehow still_ trying to establish a channel to 22 (old DNS 
> name: node5.bar.com) and were throwing below error in a loop:
> {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve 
> address: node4.bar.com}}
> {{java.net.UnknownHostException: node5.bar.com: Name or service not known}}
> {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}}
> {{ at 
> java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}}
> {{ at 
> java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}}
> {{ at 
> java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}}
> {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}}
> {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
> {{ at java.base/java.lang.Thread.run(Thread.java:834)}}
> {{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at 
> election address node5.bar.com:3888}}
> {{java.net.UnknownHostException: node5.bar.com}}
> {{ at 
> java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)}}
> {{ at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403)}}
> {{ at java.base/java.net.Socket.connect(Socket.java:591)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:650)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:714)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
> {{ at java.base/java.lang.Thread.run(Thread.java:834)}}
> {quote}
> Fetching config from live ZooKeeper znode also doesn't show "*22*" being a 
> member of the ensemble. Its not clear how "22" is still coming into the 
> picture.
> {quote}In [4]: zk.get('/zookeeper/config')
> Out[4]:
> ('server.17=node1.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> server.19=node2.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> server.20=node3.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> server.21=node4.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> server.23=node5.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> version=0',
>  ZnodeStat(czxid=0, mzxid=0, ctime=0, mtime=1588399290245, version=-1, 
> cversion=0, aversion=-1, ephemeralOwner=0, dataLength=360, numChildren=0, 
> pzxid=0))
> {quote}
> We suspected some weird caching issue and restarted ZooKeeper across all the 
> nodes but that didn't help. So, whenever node5 becomes the Leader, ID:22 is 
> popping up. We even rebooted node5 and that hasn't helped too.
> We also looked at '/zookeeper/config' content from snapshot files and did not 
> find any reference to ID:22.
> Any help would be greatly appreciated.
> NOTE: dynamic config is disabled.
> Thanks,
> Rajkiran



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ZOOKEEPER-3814) ZooKeeper caching of config

Reply via email to