[ 
https://issues.apache.org/jira/browse/CURATOR-638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zili Chen closed CURATOR-638.
-----------------------------

> Curator disconnect from zookeeper when IPs change
> -------------------------------------------------
>
>                 Key: CURATOR-638
>                 URL: https://issues.apache.org/jira/browse/CURATOR-638
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Client, Recipes
>    Affects Versions: 5.2.1
>         Environment: Docker or Kubernetes, docker example provided
>            Reporter: Francis Simon
>            Assignee: Zili Chen
>            Priority: Blocker
>             Fix For: 5.4.0
>
>         Attachments: zkissue.zip
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Blocking usage of Zookeeper in production.   Tried testing a few versions all 
> had the issue.  Effects any recipes that use ephemeral nodes.  Example 
> attached.
> We use multiple Apache Curator recipes in our system which is running in 
> Docker and Kubernetes.   The behavior I am seeing is that curator appears to 
> resolve to the IP address of the containers rather than being tied to DNS 
> names.   I have seen old tickets on this, but the behavior is reproducible on 
> the latest code release.    
> We are running zookeeper in containers on kubernetes.  In kubernetes many 
> things could cause a container to move hosts, the pod disruption budget 
> ensures that a quorum is always present.   But with this bug if all nodes 
> move for any reason and get new IP addresses clients will disconnect when 
> they shouldn't.  Disconnecting has the bad side effect that all ephemeral 
> nodes are lost.   This effects for us coordination, distributed locking and 
> service discovery.   Causes production downtime so marked as a Blocker.
> I have a simple sample which just uses the service discovery recipe to 
> register a bunch of services in zookeeper.  I run the example in docker 
> compose.   It is 100% reproducible.
>  
> {code:java}
> # Standup zookeeper and wait for it to be healthy
> docker-compose up -d zookeeper1 zookeeper2 zookeeper3 
> # Stand up a server and make sure it is connected and working as expected
> docker-compose up -d server1
> # Take down a single zookeeper node and stand up another agent.
> # The agent will grab the old zookeepers IP address
> docker-compose rm -s zookeeper1
> docker-compose up -d server2
> # Bring the zookeeper node back up.  
> # Wait for it be healthy
> docker-compose up -d zookeeper1
> # Then take down the next zookeeper node and stand up another agent.
> # The agent will grab the old zookeepers IP address
> docker-compose rm -s zookeeper2
> docker-compose up -d server3
> # Bring the zookeeper node back up. 
> # Wait for it be healthy
> docker-compose up -d zookeeper2
> # Then take down the next zookeeper node and stand up another agent.
> # The agent will grab the old zookeepers IP address
> docker-compose rm -s zookeeper3 
> docker-compose up -d server4 
> # Bring the zookeeper node back up. 
> # Wait for it be healthy 
> docker-compose up -d zookeeper3{code}
>  
> At the time of taking down the 3rd zookeeper node, the first server1 that was 
> stood up will now receive a disconnected status because the IP of all three 
> nodes has no changes form the original IP addresses.  
>  
> {code:java}
> server1_1     | Query instances for servicetest
> server1_1     | Exception in thread "main" java.lang.RuntimeException: 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /myservices/test/62e23a0b-dfdb-46f5-966f-8dc7a4978c70
> server1_1     |       at 
> org.apache.curator.shaded.com.google.common.base.Throwables.propagate(Throwables.java:241)
> server1_1     |       at 
> org.apache.curator.utils.ExceptionAccumulator.propagate(ExceptionAccumulator.java:38)
> server1_1     |       at 
> org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.close(ServiceDiscoveryImpl.java:171)
> server1_1     |       at 
> org.apache.curator.shaded.com.google.common.io.Closeables.close(Closeables.java:78)
> server1_1     |       at 
> org.apache.curator.utils.CloseableUtils.closeQuietly(CloseableUtils.java:59)
> server1_1     |       at zkissue.App.main(App.java:72)
> server1_1     | Caused by: 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /myservices/test/62e23a0b-dfdb-46f5-966f-8dc7a4978c70
> server1_1     |       at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> server1_1     |       at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
> server1_1     |       at 
> org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001)
> server1_1     |       at 
> org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:274)
> server1_1     |       at 
> org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:268)
> server1_1     |       at 
> org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
> server1_1     |       at 
> org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:265)
> server1_1     |       at 
> org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:249)
> server1_1     |       at 
> org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:34)
> server1_1     |       at 
> org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.internalUnregisterService(ServiceDiscoveryImpl.java:520)
> server1_1     |       at 
> org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.close(ServiceDiscoveryImpl.java:157)
> server1_1     |       ... 3 more
> {code}
>  
>  This causes it to disconnect and lose its discovery state which can be seen 
> from the other services.
> {code:java}
> server2_1     | Query instances for servicetest
> server2_1     | test
> server2_1     |       service description: http://server-4:57456
> server2_1     |       service description: http://server-3:37740
> server2_1     |       service description: http://server-2:40219{code}
>  
> Should mention that the Zookeeper cluster is always happy and healthy.   This 
> is a client side issue.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to