Rajkiran Sura created ZOOKEEPER-3814:
----------------------------------------

             Summary: ZooKeeper caching of config
                 Key: ZOOKEEPER-3814
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814
             Project: ZooKeeper
          Issue Type: Bug
          Components: leaderElection, quorum, server
    Affects Versions: 3.5.6
            Reporter: Rajkiran Sura


Hello,

We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. 
Encountered no issues as such.

This is how the ZooKeeper config looks like:
{quote}tickTime=2000
dataDir=/zookeeper-data/
initLimit=5
syncLimit=2
maxClientCnxns=2048
autopurge.snapRetainCount=3
autopurge.purgeInterval=1
4lw.commands.whitelist=stat, ruok, conf, isro, mntr
authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
requireClientAuthScheme=sasl
quorum.cnxn.threads.size=20
quorum.auth.enableSasl=true
quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
quorum.auth.learnerRequireSasl=true
quorum.auth.learner.saslLoginContext=QuorumLearner
quorum.auth.serverRequireSasl=true
quorum.auth.server.saslLoginContext=QuorumServer
server.17=node1.foo.bar.com:2888:3888;2181
server.19=node2.foo.bar.com:2888:3888;2181
server.20=node3.foo.bar.com:2888:3888;2181
server.21=node4.foo.bar.com:2888:3888;2181
server.22=node5.bar.com:2888:3888;2181
{quote}
Post upgrade, we had to migrate server.22 on the same node, but with 
*FOO*.bar.com domain name due to kerberos referral issues. And, we used 
different server-identifier, i.e., *23* when we migrated. So, here is how the 
new config looked like:
{quote}server.17=node1.foo.bar.com:2888:3888;2181
server.19=node2.foo.bar.com:2888:3888;2181
server.20=node3.foo.bar.com:2888:3888;2181
server.21=node4.foo.bar.com:2888:3888;2181
*server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181*
{quote}
We restarted all the nodes in the ensemble with the above updated config. And 
the migrated node joined the quorum successfully and was serving all clients 
directly connected to it, without any issues.

Recently, when a leader election happened, 
server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has 
highest ID). But then, ZooKeeper was unable to serve any clients and *all* the 
servers were _somehow still_ trying to establish a channel to 22 (old DNS name: 
node5.bar.com) and were throwing below error in a loop:
{quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN 
[WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve 
address: node4.bar.com}}
{{java.net.UnknownHostException: node5.bar.com: Name or service not known}}
{{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}}
{{ at 
java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}}
{{ at 
java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}}
{{ at 
java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}}
{{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}}
{{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}}
{{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}}
{{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}}
{{ at 
org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}}
{{ at 
org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}}
{{ at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}}
{{ at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
{{ at 
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
{{ at 
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
{{ at java.base/java.lang.Thread.run(Thread.java:834)}}
{{2020-05-02 01:43:03,026 [myid:23] - WARN 
[WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at 
election address node5.bar.com:3888}}
{{java.net.UnknownHostException: node5.bar.com}}
{{ at 
java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)}}
{{ at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403)}}
{{ at java.base/java.net.Socket.connect(Socket.java:591)}}
{{ at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:650)}}
{{ at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:714)}}
{{ at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
{{ at 
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
{{ at 
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
{{ at java.base/java.lang.Thread.run(Thread.java:834)}}
{quote}
Fetching config from live ZooKeeper znode also doesn't show "*22*" being a 
member of the ensemble. Its not clear how "22" is still coming into the picture.
{quote}In [4]: zk.get('/zookeeper/config')
Out[4]:
('server.17=node1.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n

server.19=node2.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n

server.20=node3.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n

server.21=node4.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n

server.23=node5.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n

version=0',
 ZnodeStat(czxid=0, mzxid=0, ctime=0, mtime=1588399290245, version=-1, 
cversion=0, aversion=-1, ephemeralOwner=0, dataLength=360, numChildren=0, 
pzxid=0))
{quote}
We suspected some weird caching issue and restarted ZooKeeper across all the 
nodes but that didn't help. So, whenever node5 becomes the Leader, ID:22 is 
popping up. We even rebooted node5 and that hasn't helped too.

We also looked at '/zookeeper/config' content from snapshot files and did not 
find any reference to ID:22.

Any help would be greatly appreciated.

NOTE: dynamic config is disabled.

Thanks,
Rajkiran



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to