[
https://issues.apache.org/jira/browse/ZOOKEEPER-3991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mate Szalay-Beko resolved ZOOKEEPER-3991.
-----------------------------------------
Fix Version/s: 3.6.3
3.7.0
Resolution: Fixed
> QuorumCnxManager Listener port bind retry does not retry DNS lookup
> -------------------------------------------------------------------
>
> Key: ZOOKEEPER-3991
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3991
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum
> Affects Versions: 3.6.2
> Reporter: Lander Visterin
> Priority: Minor
> Labels: pull-request-available
> Fix For: 3.7.0, 3.6.3
>
> Attachments: repro.tar.gz
>
> Time Spent: 2h 20m
> Remaining Estimate: 0h
>
> We run Zookeeper in a container environment where DNS is not stable. As
> recommended by the documentation, we set _electionPortBindRetry_ to 0 (keeps
> retrying forever).
> On some instances, we get the following exception in an infinite loop, even
> though the address already became resolve-able:
>
> {noformat}
> zk-2_1 | 2020-11-03 10:57:08,407 [myid:3] - ERROR
> [ListenerHandler-zk-2.test:3888:QuorumCnxManager$Listener$ListenerHandler@1093]
> - Exception while listening
> zk-2_1 | java.net.SocketException: Unresolved address
> zk-2_1 | at java.base/java.net.ServerSocket.bind(Unknown Source)
> zk-2_1 | at java.base/java.net.ServerSocket.bind(Unknown Source)
> zk-2_1 | at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.createNewServerSocket(QuorumCnxManager.java:1140)
> zk-2_1 | at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.acceptConnections(QuorumCnxManager.java:1064)
> zk-2_1 | at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.run(QuorumCnxManager.java:1033)
> zk-2_1 | at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> zk-2_1 | at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
> zk-2_1 | at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> zk-2_1 | at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> zk-2_1 | at java.base/java.lang.Thread.run(Unknown Source){noformat}
> Zookeeper does not actually retry the DNS resolution, it just keeps using the
> old failed result.
>
> This happens because the InetSocketAddress is created once and the DNS lookup
> happens when it is created.
> This issue has come up previously in
> https://issues.apache.org/jira/browse/ZOOKEEPER-1506 but it appears to still
> happen here.
> I have attached a repro.tar.gz to help reproduce this issue. Steps:
> * Untar repro.tar.gz
> * docker-compose up
> * See the exception keeps happening for zk-2, not for the others
> * Open db.test and uncomment the zk-2 line, increment the serial and save
> * Wait a few seconds for the DNS to refresh
> * Verify that you can resolve zk-2.test now (dig @172.16.60.2 zk-2.test) but
> the error keeps appearing
> I have also attached a patch that resolves this. The patch will retry DNS
> resolution if the address is still unresolved every time it tries to create
> the server socket.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)