Hello Rajkiran,

did you also change the content of the 'myid' file from 22 to 23 when you
migrated the node?
Please note, that the newer ZooKeeper you use sending it's ID during the
leader election protocol initiation. So if a server still thinks that his
ID is 22 then it will send this ID to the others, who will believe the ID
and try to find a host address for this server (either from the config or
from the last committed view).

Kind regards,
Mate

On Sat, May 2, 2020 at 8:55 AM Rajkiran Sura (Jira) <j...@apache.org> wrote:

> Rajkiran Sura created ZOOKEEPER-3814:
> ----------------------------------------
>
>              Summary: ZooKeeper caching of config
>                  Key: ZOOKEEPER-3814
>                  URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814
>              Project: ZooKeeper
>           Issue Type: Bug
>           Components: leaderElection, quorum, server
>     Affects Versions: 3.5.6
>             Reporter: Rajkiran Sura
>
>
> Hello,
>
> We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6.
> Encountered no issues as such.
>
> This is how the ZooKeeper config looks like:
> {quote}tickTime=2000
> dataDir=/zookeeper-data/
> initLimit=5
> syncLimit=2
> maxClientCnxns=2048
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> 4lw.commands.whitelist=stat, ruok, conf, isro, mntr
> authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
> requireClientAuthScheme=sasl
> quorum.cnxn.threads.size=20
> quorum.auth.enableSasl=true
> quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
> quorum.auth.learnerRequireSasl=true
> quorum.auth.learner.saslLoginContext=QuorumLearner
> quorum.auth.serverRequireSasl=true
> quorum.auth.server.saslLoginContext=QuorumServer
> server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> server.22=node5.bar.com:2888:3888;2181
> {quote}
> Post upgrade, we had to migrate server.22 on the same node, but with *FOO*.
> bar.com domain name due to kerberos referral issues. And, we used
> different server-identifier, i.e., *23* when we migrated. So, here is how
> the new config looked like:
> {quote}server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181*
> {quote}
> We restarted all the nodes in the ensemble with the above updated config.
> And the migrated node joined the quorum successfully and was serving all
> clients directly connected to it, without any issues.
>
> Recently, when a leader election happened, 
> server.*23*=node5.foo.bar.com(migrated
> node) was chosen as Leader (as it has highest ID). But then, ZooKeeper was
> unable to serve any clients and *all* the servers were _somehow still_
> trying to establish a channel to 22 (old DNS name: node5.bar.com) and
> were throwing below error in a loop:
> {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN
> [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve
> address: node4.bar.com}}
> {{java.net.UnknownHostException: node5.bar.com: Name or service not
> known}}
> {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native
> Method)}}
> {{ at java.base/java.net
> .InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}}
> {{ at java.base/java.net
> .InetAddress.getAddressesFromNameService(InetAddress.java:1515)}}
> {{ at java.base/java.net
> .InetAddress$NameServiceAddresses.get(InetAddress.java:848)}}
> {{ at java.base/java.net
> .InetAddress.getAllByName0(InetAddress.java:1505)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}}
> {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}}
> {{ at
> org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}}
> {{ at
> org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}}
> {{ at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}}
> {{ at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
> {{ at
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
> {{ at
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
> {{ at java.base/java.lang.Thread.run(Thread.java:834)}}
> {{2020-05-02 01:43:03,026 [myid:23] - WARN
> [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22
> at election address node5.bar.com:3888}}
> {{java.net.UnknownHostException: node5.bar.com}}
> {{ at java.base/java.net
> .AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)}}
> {{ at java.base/java.net
> .SocksSocketImpl.connect(SocksSocketImpl.java:403)}}
> {{ at java.base/java.net.Socket.connect(Socket.java:591)}}
> {{ at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:650)}}
> {{ at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:714)}}
> {{ at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
> {{ at
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
> {{ at
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
> {{ at java.base/java.lang.Thread.run(Thread.java:834)}}
> {quote}
> Fetching config from live ZooKeeper znode also doesn't show "*22*" being a
> member of the ensemble. Its not clear how "22" is still coming into the
> picture.
> {quote}In [4]: zk.get('/zookeeper/config')
> Out[4]:
> ('server.17=node1.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
>
> server.19=node2.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
>
> server.20=node3.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
>
> server.21=node4.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
>
> server.23=node5.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
>
> version=0',
>  ZnodeStat(czxid=0, mzxid=0, ctime=0, mtime=1588399290245, version=-1,
> cversion=0, aversion=-1, ephemeralOwner=0, dataLength=360, numChildren=0,
> pzxid=0))
> {quote}
> We suspected some weird caching issue and restarted ZooKeeper across all
> the nodes but that didn't help. So, whenever node5 becomes the Leader,
> ID:22 is popping up. We even rebooted node5 and that hasn't helped too.
>
> We also looked at '/zookeeper/config' content from snapshot files and did
> not find any reference to ID:22.
>
> Any help would be greatly appreciated.
>
> NOTE: dynamic config is disabled.
>
> Thanks,
> Rajkiran
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
>

Reply via email to