[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mate Szalay-Beko resolved ZOOKEEPER-3991.
-----------------------------------------
    Fix Version/s: 3.6.3
                   3.7.0
       Resolution: Fixed

> QuorumCnxManager Listener port bind retry does not retry DNS lookup
> -------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3991
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3991
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.6.2
>            Reporter: Lander Visterin
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 3.7.0, 3.6.3
>
>         Attachments: repro.tar.gz
>
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> We run Zookeeper in a container environment where DNS is not stable. As 
> recommended by the documentation, we set _electionPortBindRetry_ to 0 (keeps 
> retrying forever).
> On some instances, we get the following exception in an infinite loop, even 
> though the address already became resolve-able:
>  
> {noformat}
> zk-2_1  | 2020-11-03 10:57:08,407 [myid:3] - ERROR 
> [ListenerHandler-zk-2.test:3888:QuorumCnxManager$Listener$ListenerHandler@1093]
>  - Exception while listening
> zk-2_1  | java.net.SocketException: Unresolved address
> zk-2_1  |     at java.base/java.net.ServerSocket.bind(Unknown Source)
> zk-2_1  |     at java.base/java.net.ServerSocket.bind(Unknown Source)
> zk-2_1  |     at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.createNewServerSocket(QuorumCnxManager.java:1140)
> zk-2_1  |     at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.acceptConnections(QuorumCnxManager.java:1064)
> zk-2_1  |     at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.run(QuorumCnxManager.java:1033)
> zk-2_1  |     at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> zk-2_1  |     at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
> zk-2_1  |     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> zk-2_1  |     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> zk-2_1  |     at java.base/java.lang.Thread.run(Unknown Source){noformat}
> Zookeeper does not actually retry the DNS resolution, it just keeps using the 
> old failed result.
>  
> This happens because the InetSocketAddress is created once and the DNS lookup 
> happens when it is created.
> This issue has come up previously in 
> https://issues.apache.org/jira/browse/ZOOKEEPER-1506 but it appears to still 
> happen here.
> I have attached a repro.tar.gz to help reproduce this issue. Steps:
>  * Untar repro.tar.gz
>  * docker-compose up
>  * See the exception keeps happening for zk-2, not for the others
>  * Open db.test and uncomment the zk-2 line, increment the serial and save
>  * Wait a few seconds for the DNS to refresh
>  * Verify that you can resolve zk-2.test now (dig @172.16.60.2 zk-2.test) but 
> the error keeps appearing
> I have also attached a patch that resolves this. The patch will retry DNS 
> resolution if the address is still unresolved every time it tries to create 
> the server socket.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to