[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102567#comment-17102567
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3814:
---------------------------------------------

I have a docker environment ([https://github.com/symat/zookeeper-docker-test]) 
where I tried to create a cluster and simulating the config change you did. I 
haven't faced your issue. For me, the new node become a leader without any 
problem and the old id/hostname were never appeared in the logs of the rest of 
the nodes. Also my \{{zoo.cfg.dynamic.next}} file was always contain the 
correct server configs.

Regarding to some of your previous comment:
{quote}Latest observation, we noticed that ZooKeeper was complaining about 
dynamic.next file, event though we HAVE NOT ENABLED dynamic-reconfiguration.
{quote}
This is the way how the newer ZooKeeper works, this is not a bug. I think you 
can find more info here: 
[https://zookeeper.apache.org/doc/r3.5.7/zookeeperReconfig.html]
{quote}And zookeeper user did not have perms to that config directory, so we 
fixed that restarted zookeeper. And then it dumped below dynamic.next, which 
contains the OLD migrated node as a member
{quote}
This is really strange. This really suggest that somehow the nodes "cached" the 
old config and their configuration haven't got updated with the server list of 
the new static config.

I was checking the code, and can not find what the root-cause might have been.  
I tried many scenarios manually but never manage to reproduce what you see.

I wonder if executing the "{{sync /zookeeper/config}}  " command would help in 
this case. 

If by any chance you still have (and you have the permission to share) the full 
server logs from all the servers during the time when you changed the hostname 
of the last node, I would be happy to take a look. I might see something that 
help us to find the root cause of this issue.

> ZooKeeper caching of config
> ---------------------------
>
>                 Key: ZOOKEEPER-3814
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection, quorum, server
>    Affects Versions: 3.5.6
>            Reporter: Rajkiran Sura
>            Assignee: Mate Szalay-Beko
>            Priority: Major
>
> Hello,
> We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. 
> Encountered no issues as such.
> This is how the ZooKeeper config looks like:
> {quote}tickTime=2000
> dataDir=/zookeeper-data/
> initLimit=5
> syncLimit=2
> maxClientCnxns=2048
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> 4lw.commands.whitelist=stat, ruok, conf, isro, mntr
> authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
> requireClientAuthScheme=sasl
> quorum.cnxn.threads.size=20
> quorum.auth.enableSasl=true
> quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
> quorum.auth.learnerRequireSasl=true
> quorum.auth.learner.saslLoginContext=QuorumLearner
> quorum.auth.serverRequireSasl=true
> quorum.auth.server.saslLoginContext=QuorumServer
> server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> server.22=node5.bar.com:2888:3888;2181
> {quote}
> Post upgrade, we had to migrate server.22 on the same node, but with 
> *FOO*.bar.com domain name due to kerberos referral issues. And, we used 
> different server-identifier, i.e., *23* when we migrated. So, here is how the 
> new config looked like:
> {quote}server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181*
> {quote}
> We restarted all the nodes in the ensemble with the above updated config. And 
> the migrated node joined the quorum successfully and was serving all clients 
> directly connected to it, without any issues.
> Recently, when a leader election happened, 
> server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has 
> highest ID). But then, ZooKeeper was unable to serve any clients and *all* 
> the servers were _somehow still_ trying to establish a channel to 22 (old DNS 
> name: node5.bar.com) and were throwing below error in a loop:
> {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve 
> address: node4.bar.com}}
> {{java.net.UnknownHostException: node5.bar.com: Name or service not known}}
> {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}}
> {{ at 
> java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}}
> {{ at 
> java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}}
> {{ at 
> java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}}
> {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}}
> {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
> {{ at java.base/java.lang.Thread.run(Thread.java:834)}}
> {{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at 
> election address node5.bar.com:3888}}
> {{java.net.UnknownHostException: node5.bar.com}}
> {{ at 
> java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)}}
> {{ at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403)}}
> {{ at java.base/java.net.Socket.connect(Socket.java:591)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:650)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:714)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
> {{ at java.base/java.lang.Thread.run(Thread.java:834)}}
> {quote}
> Fetching config from live ZooKeeper znode also doesn't show "*22*" being a 
> member of the ensemble. Its not clear how "22" is still coming into the 
> picture.
> {quote}In [4]: zk.get('/zookeeper/config')
> Out[4]:
> ('server.17=node1.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> server.19=node2.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> server.20=node3.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> server.21=node4.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> server.23=node5.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n
> version=0',
>  ZnodeStat(czxid=0, mzxid=0, ctime=0, mtime=1588399290245, version=-1, 
> cversion=0, aversion=-1, ephemeralOwner=0, dataLength=360, numChildren=0, 
> pzxid=0))
> {quote}
> We suspected some weird caching issue and restarted ZooKeeper across all the 
> nodes but that didn't help. So, whenever node5 becomes the Leader, ID:22 is 
> popping up. We even rebooted node5 and that hasn't helped too.
> We also looked at '/zookeeper/config' content from snapshot files and did not 
> find any reference to ID:22.
> Any help would be greatly appreciated.
> NOTE: dynamic config is disabled.
> Thanks,
> Rajkiran



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to