[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-05-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468194#comment-16468194
 ] 

Hudson commented on ZOOKEEPER-2982:
---

FAILURE: Integrated in Jenkins build ZooKeeper-trunk #15 (See 
[https://builds.apache.org/job/ZooKeeper-trunk/15/])
ZOOKEEPER-2982: Re-try DNS hostname -> IP resolution if node connection (fpj: 
rev b1f6279a8c4708d1df7dd1128dc4fdf41fc7e24a)
* (edit) src/java/main/org/apache/zookeeper/server/quorum/Learner.java


> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Assignee: Flavio Junqueira
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-05-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468019#comment-16468019
 ] 

Hadoop QA commented on ZOOKEEPER-2982:
--

+1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+0 tests included.  The patch appears to be a documentation patch that 
doesn't require tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 3.0.1) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1665//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1665//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1665//console

This message is automatically generated.

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Assignee: Flavio Junqueira
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-05-08 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467049#comment-16467049
 ] 

Flavio Junqueira commented on ZOOKEEPER-2982:
-

Based on this comment, I'm +1 from this change:

https://issues.apache.org/jira/browse/ZOOKEEPER-2982?focusedCommentId=16371886=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16371886

It would have been good to have a test case, but we haven't been able to come 
up with anything, so I suggest we leave it for future work. We also have a +1 
from [~andorm] in the pull request.

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Assignee: Flavio Junqueira
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-05-02 Thread Abhin Balur (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460772#comment-16460772
 ] 

Abhin Balur commented on ZOOKEEPER-2982:


Can someone please comment on whether this is getting merged in 3.5.4?? When is 
3.5.4 expected?

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-04-23 Thread Eron Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448996#comment-16448996
 ] 

Eron Wright  commented on ZOOKEEPER-2982:
-

[~fpj] would you please assign this issue to someone and clarify what your 
expectations are? 

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-04-16 Thread Eron Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439848#comment-16439848
 ] 

Eron Wright  commented on ZOOKEEPER-2982:
-

Are we ready to accept this patch yet?   

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-03-14 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398500#comment-16398500
 ] 

Flavio Junqueira commented on ZOOKEEPER-2982:
-

hey guys, here are some comments based on the latest points:

[~abrahamfine] the improvement you are proposing requires further discussion. 
for one thing, the user has told us to use names and now we are trying to 
second-guess how to connect to the server. I'm not saying this is necessarily a 
bad idea, but I feel it needs to be addressed separately. I think we should go 
for now with the fix that [~eronwright] is proposing as it fixes an issue with 
the port.

[~andorm] do you think you will be able to come up with a test case? To recap, 
I think we need to test that we are able to resolve names correctly despite 
changes in the mapping of name to address. I'm not sure what a good way of 
testing it would be.

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-03-13 Thread Eron Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397477#comment-16397477
 ] 

Eron Wright  commented on ZOOKEEPER-2982:
-

[~abrahamfine] are you OK with accepting this patch as-is for now?  This would 
fix the regression between 3.4 and 3.5.  Thanks!

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-03-09 Thread Eron Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16393254#comment-16393254
 ] 

Eron Wright  commented on ZOOKEEPER-2982:
-

[~andorm] are you driving this issue now?   Would you please assign the bug 
appropriate?

I'm keen to see the patch make it into 3.5.4.

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-26 Thread Eron Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16377219#comment-16377219
 ] 

Eron Wright  commented on ZOOKEEPER-2982:
-

My humble opinion is that there's no risk in taking this patch as-is, since it 
fixes a regression.   Improvements could be made from there, on their own 
strength.

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-23 Thread Abraham Fine (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375213#comment-16375213
 ] 

Abraham Fine commented on ZOOKEEPER-2982:
-

[~fpj] I believe your diagnosis to be correct and I agree that [~eronwright]'s 
fix would solve the problem in the case that DNS eventually is fixed. My 
concern with the current solution is that it could cause us to jump back and 
forth between leader election and the quorum when the DNS stays in a bad state. 
For example, imagine a 3 node cluster {z1, z2, z3}. z3 is always offline and z2 
has no entry in dns. z2 will connect to z1 and win the leader election. When it 
comes time to form the quorum z1 will be unable to follow z2 as it wont be able 
to resolve its address.

Just spitballing here, but what if we had z1 connect to the 
{{remoteSocketAddress}} of the socket created from the connection it received 
in {{QuorumCnxManager}}? I understand there are some security concerns here and 
I'm not sure how much we care about that since they would be stifled by 
Kerberos. We could also do a reverse dns lookup and reject the connection if 
the reverse lookup does not align with our expected hostname. 

What do you guys think?


> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-23 Thread Eron Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375152#comment-16375152
 ] 

Eron Wright  commented on ZOOKEEPER-2982:
-

[~fpj] ready to merge this fix?

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-22 Thread Andor Molnar (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373047#comment-16373047
 ] 

Andor Molnar commented on ZOOKEEPER-2982:
-

[~fpj] Sounds reasonable to me.

I'll think about how to test this and will let you know when I can come up with 
something. 

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-22 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372542#comment-16372542
 ] 

Flavio Junqueira commented on ZOOKEEPER-2982:
-

I have tried your recipe for reproducing as well [~andorm] by changing 
{{/etc/hosts}} and got the same issue. The problem is that the leader fails to 
bind to the port, which actually makes me wonder whether we need to do anything 
about the leader with respect to this issue:

```
java.net.SocketException: Unresolved address
at java.net.ServerSocket.bind(ServerSocket.java:368)
at java.net.ServerSocket.bind(ServerSocket.java:329)
at org.apache.zookeeper.server.quorum.Leader.(Leader.java:240)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.makeLeader(QuorumPeer.java:1023)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1226)
```

Your suggestion of the alternative change is sensible, but I'd say that for 
consistency, it is better that we simply do the same that we have in 3.4, which 
is to make the change in {{findLeader}}.

One thing that I believe we haven't been able to do is to have a test case to 
report it. It would be good to have it, but I'm not sure what would be a good 
way.

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-21 Thread Andor Molnar (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371980#comment-16371980
 ] 

Andor Molnar commented on ZOOKEEPER-2982:
-

Looks like that sock.connect() in connectToLeader() requires the address to be 
resolved already.

If QuorumCnxnManager() fails to do that, connectToLeader() might be able to 
detect & fix it by doing resolution explicitly when addr.isUnresolved() == true.

Not sure if it's any better than doing recreateSocketAddress() in findLeader(), 
but it may be another option to consider.

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-21 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371886#comment-16371886
 ] 

Flavio Junqueira commented on ZOOKEEPER-2982:
-

I think I know what's going on. It is correct that {{connectOne}} invokes 
{{recreateSocketAddresses}}, but that invocation won't happen in the case that 
the server receives a connection request rather than starting the connection. 
In fact, servers with larger ids are always supposed to start the connections.

I think the patch proposed here of invoking {{recreateSocketAddresses}} in 
{{findLeader}} in the learner class makes sense to compensate for 
{{recreateSocketAddresses}} not being invoked during leader election. Any other 
insight or anything I'm missing?



> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-21 Thread Andor Molnar (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371943#comment-16371943
 ] 

Andor Molnar commented on ZOOKEEPER-2982:
-

[~fpj] I think you're right.

Looking at the logfile that [~eronwright]uploaded recently I see connections 
are being received from other members of the ensembe on the election port. 
myid=1 is just unable to initiate due to DNS error and doesn't either give up 
connection, because not realizing that myid=1 < id of the other connection.

I believe the ensembe looks like this:

myid=1 zookeeper-0.zookeeper-headless.default.svc.cluster.local 10.8.2.14
myid=2 zookeeper-1.zookeeper-headless.default.svc.cluster.local 10.8.0.6
myid=3 zookeeper-2.zookeeper-headless.default.svc.cluster.local 10.8.1.7

 

 

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-21 Thread Eron Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371727#comment-16371727
 ] 

Eron Wright  commented on ZOOKEEPER-2982:
-

Thanks everyone for your patience.  I've attached a log showing the unpatched 
behavior (3.5.3-beta).   You'll observe two ongoing exceptions, one from 
{{QuorumCnxManager}}, the other from {{Learner}}. Around {{2018-02-21 
17:23:36,746}}, the DNS record finally comes up and {{QuorumCnxManager}} 
recovers whereas {{Learner}} doesn't.

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-21 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371452#comment-16371452
 ] 

Flavio Junqueira commented on ZOOKEEPER-2982:
-

>From the logs, we can see the same exception being raised when the server is 
>trying to connect to elect a leader:

{noformat}
2018-02-20 20:41:25,669 [myid:1] - WARN  
[QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer$QuorumServer@173]
 - Failed to resolve address: 
pravega-zookeeper-2.pravega-zookeeper-headless.default.svc.cluster.local
java.net.UnknownHostException: 
pravega-zookeeper-2.pravega-zookeeper-headless.default.svc.cluster.local
at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)
at 
org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:171)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:727)
at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:682)
at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:716)
at 
org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:919)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1190)
{noformat}

Once the address resolves and it can connect, the exception goes away and the 
notification messages flow regularly. The question is why the update performed 
during leader election to the quorum view in {{QuorumCnxManager.connectOne}} is 
not taking any effect in the view that {{Learner.findLeader}} uses to get the 
`QuorumServer` instance to connect to the leader. Two possibilities I can think 
of:

1- The server hasn't connected to the elected server during leader election, in 
which case the address wasn't updated.
2- The quorum view that the learner is using to get the quorum server instance 
is not the one that was updated in {{QuorumCnxManager}}.



> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-21 Thread Andor Molnar (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371425#comment-16371425
 ] 

Andor Molnar commented on ZOOKEEPER-2982:
-

[~eronwright]

I've tried this on localhost by adding fake dns names to /etc/hosts like this:
{noformat}
127.0.0.1 one.andor.org
127.0.0.1 two.andor.org
#127.0.0.1 three.andor.org{noformat}
First, all of the 3 entries were commented out and I started ZooKeeper nodes 
with the following server config:
{noformat}
server.1=one.andor.org::2223
server.2=two.andor.org::3334
server.3=three.andor.org::4445
{noformat}
Nodes were unable to connect because of the following resolution error:
{noformat}
2018-02-21 14:33:25,509 [myid:1] - WARN 
[QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer$QuorumServer@172]
 - Failed to resolve address: two.andor.org
java.net.UnknownHostException: two.andor.org
at java.net.InetAddress.getAllByName0(InetAddress.java:1273)
at java.net.InetAddress.getAllByName(InetAddress.java:1185)
at java.net.InetAddress.getAllByName(InetAddress.java:1119)
at java.net.InetAddress.getByName(InetAddress.java:1069)
at 
org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:170)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:726)
at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:686)
at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:720)
at 
org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:919)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1171){noformat}
Similar entries are keep repeated in both server logs. As I can see ZK is 
trying to call recreateSocketAddresses() and tries to re-resolve the address 
every time it's trying to connect.

This is the case _without_ your patch.

When I enabled the entries in /etc/hosts, the following error showed up in the 
logs:
{noformat}
2018-02-21 14:37:07,178 [myid:1] - WARN 
[QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumCnxManager@663]
 - Cannot open channel to 2 at election address two.andor.org/127.0.0.1:3334
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:580)
at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:641)
at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:692)
at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:720)
at 
org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:919)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1171){noformat}
The error shows that DNS resolution was successful (127.0.0.1) and the 
connection issue is different (Connection refused) which might be related to my 
silly test environment (socket has not been created on the other side), but the 
key takeaway here is that [~abrahamfine] is probably right and the 
re-resolution happens properly.

I repeated the test with your patch too and the results are the same. No 
difference.

Maybe I'm missing something and the test might not be relevant at all, but at 
least it's a little bit confusing.

[~eronwright]Would you please attach logs running the same ensemble _without_ 
your patch too?

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - 

[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-20 Thread Eron Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370735#comment-16370735
 ] 

Eron Wright  commented on ZOOKEEPER-2982:
-

Attached 'fixed.log' which demonstrates the behavior after the fix is applied.  
 Let me know if you also need to see the output from an unpatched cluster (I 
would prefer not to spend the time to get that).

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-20 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370721#comment-16370721
 ] 

Flavio Junqueira commented on ZOOKEEPER-2982:
-

[~eronwright] could you upload some server logs so that we can have a look, 
please?

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-20 Thread Eron Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370712#comment-16370712
 ] 

Eron Wright  commented on ZOOKEEPER-2982:
-

The step-by-step instructions are:
1. configure a three node ensemble (0,1,2).
2. by whatever means, configure 0 such that it cannot resolve the DNS address 
of 1 and/or 2.   Do likewise for other servers.
3. Start the ensemble, observing the {{UnknownHostException}} as shown in the 
description.
4. While the servers are running, fix the DNS issue so that the addresses may 
be resolved.
5. Observe that the exception continues to occur.

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-20 Thread Abraham Fine (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370708#comment-16370708
 ] 

Abraham Fine commented on ZOOKEEPER-2982:
-

[~eronwright] I think `connectToLeader` uses the address from `findLeader` 
which should read the `QuorumVerifier` updated by `recreateSocketAddresses` 
called in `connectOne`.

I have a feeling it would be tough, but if you could come up with a test to 
reproduce the issue you are facing or give us step by step instructions to 
reproduce (ideally outside of K8s) that would help us confirm the problem.

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-20 Thread Eron Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370699#comment-16370699
 ] 

Eron Wright  commented on ZOOKEEPER-2982:
-

[~fpj] that is correct; without the patch, the ensemble never comes online.

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-20 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370695#comment-16370695
 ] 

Flavio Junqueira commented on ZOOKEEPER-2982:
-

[~eronwright] in your set up, once you get that exception, is it the case that 
the ensemble never recovers (it is never able to elect a leader)?

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-20 Thread Eron Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370694#comment-16370694
 ] 

Eron Wright  commented on ZOOKEEPER-2982:
-

Sorry I was just responding to [~abrahamfine] regarding why `connectOne` 
doesn't cover all cases.

Based on empirical evidence I think this patch is critical for ZK 3.5.

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-20 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370690#comment-16370690
 ] 

Flavio Junqueira commented on ZOOKEEPER-2982:
-

[~eronwright] a follower invokes followLeader after leader election completes. 
followLeader makes a call to connectToLeader.

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-20 Thread Eron Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370688#comment-16370688
 ] 

Eron Wright  commented on ZOOKEEPER-2982:
-

Looking at the code a bit, the learner makes a raw socket connection 
(`connectToLeader` -> `sockConnect`) not involving `QuorumCnxManager`.   Sorry 
I don't know all the details.   

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-20 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370682#comment-16370682
 ] 

Flavio Junqueira commented on ZOOKEEPER-2982:
-

I don't see a reason why it would be a problem to invoke 
\{recreateSocketAddresses} from \{Learner.findLeader}. That method is simply 
returning the \{QuorumServer} instance corresponding to the server it believes 
to be the leader.

The thing that is puzzling is how a server ended up voting for another server 
that it can't talk to (because the name doesn't resolve). Is it a race that 
eventually goes away?

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-20 Thread Eron Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370676#comment-16370676
 ] 

Eron Wright  commented on ZOOKEEPER-2982:
-

I can confirm that the fix is effective in my K8s environment.  I discovered 
the problem while using 3.5.3-beta.

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370670#comment-16370670
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2982:
---

Github user afine commented on the issue:

https://github.com/apache/zookeeper/pull/468
  
@EronWright I left a comment in the JIRA. 


> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-20 Thread Abraham Fine (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370631#comment-16370631
 ] 

Abraham Fine commented on ZOOKEEPER-2982:
-

I'm wondering if  [~rthille] can chime in on this.

It looks like the change this JIRA is talking about is referenced by 
https://issues.apache.org/jira/browse/ZOOKEEPER-1506?focusedCommentId=14711955=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14711955

Is there a reason why this change was left out of branch-3.5 (and master)?

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369596#comment-16369596
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2982:
---

GitHub user EronWright opened a pull request:

https://github.com/apache/zookeeper/pull/468

ZOOKEEPER-2982: Re-try DNS hostname -> IP resolution if node connection 
fails

This PR ports a fix from the 3.4 to the 3.5 branch, that was originally 
made in ZOOKEEPER-1506.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/EronWright/zookeeper ZOOKEEPER-2982

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zookeeper/pull/468.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #468


commit 4f8f3ce8074b878f2a6e32c15ec177f4dcd050e0
Author: Eron Wright 
Date:   2018-02-19T23:05:44Z

ZOOKEEPER-2982 Re-try DNS hostname -> IP resolution if node connection fails




> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-19 Thread Eron Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369441#comment-16369441
 ] 

Eron Wright  commented on ZOOKEEPER-2982:
-

Looking at `Learner` in 3.4 versus 3.5, a necessary call within `findLeader` to 
`recreateSocketAddresses` was added in 3.4 but not ported to 3.5.  The 
suggested fix is to add the call.

Note that other elements of ZOOKEEPR-1506 were already ported:
https://github.com/apache/zookeeper/commit/d2a49163b7bc7c9589140dbba7f60e591028f908


> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)