Juan Ramos created GEODE-9402:
---------------------------------

             Summary: Automatic Reconnect Failure: Address already in use
                 Key: GEODE-9402
                 URL: https://issues.apache.org/jira/browse/GEODE-9402
             Project: Geode
          Issue Type: Bug
          Components: membership
            Reporter: Juan Ramos


There are 2 locators and 4 servers during the test, once they're all up and 
running the test drops the network connectivity between all members to generate 
a full network partition and cause all members to shutdown and go into 
reconnect mode. Upon reaching the mentioned state, the test automatically 
restores the network connectivity and expects all members to automatically go 
up again and re-form the distributed system.
 This works fine most of the time, and we see every member successfully 
reconnecting to the distributed system:
{noformat}
[info 2021/06/23 15:58:12.981 GMT gemfire-cluster-locator-0 <ReconnectThread> 
tid=0x87] Reconnect completed.

[info 2021/06/23 15:58:14.726 GMT gemfire-cluster-locator-1 <ReconnectThread> 
tid=0x86] Reconnect completed.

[info 2021/06/23 15:58:46.702 GMT gemfire-cluster-server-0 <ReconnectThread> 
tid=0x94] Reconnect completed.

[info 2021/06/23 15:58:46.485 GMT gemfire-cluster-server-1 <ReconnectThread> 
tid=0x96] Reconnect completed.

[info 2021/06/23 15:58:46.273 GMT gemfire-cluster-server-2 <ReconnectThread> 
tid=0x97] Reconnect completed.

[info 2021/06/23 15:58:46.902 GMT gemfire-cluster-server-3 <ReconnectThread> 
tid=0x95] Reconnect completed.
{noformat}
In some rare occasions, though, one of the servers fails during the reconnect 
phase with the following exception:
{noformat}
[error 2021/06/09 18:48:52.872 GMT gemfire-cluster-server-1 <ReconnectThread> 
tid=0x91] Cache initialization for GemFireCache[id = 575310555; isClosing = 
false; isShutDownAll = false; created = Wed Jun 09 18:46:49 GMT 2021; server = 
false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed because:
org.apache.geode.GemFireIOException: While starting cache server CacheServer on 
port=40404 client subscription config policy=none client subscription config 
capacity=1 client subscription config overflow directory=.
        at 
org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:800)
        at 
org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:599)
        at 
org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:339)
        at 
org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4207)
        at 
org.apache.geode.internal.cache.ClusterConfigurationLoader.applyClusterXmlConfiguration(ClusterConfigurationLoader.java:197)
        at 
org.apache.geode.internal.cache.GemFireCacheImpl.applyJarAndXmlFromClusterConfig(GemFireCacheImpl.java:1497)
        at 
org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1449)
        at 
org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2668)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2426)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1277)
        at 
org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1183)
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1807)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.BindException: Address already in use (Bind failed)
        at java.base/java.net.PlainSocketImpl.socketBind(Native Method)
        at 
java.base/java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:436)
        at java.base/java.net.ServerSocket.bind(ServerSocket.java:395)
        at 
org.apache.geode.internal.net.SCClusterSocketCreator.createServerSocket(SCClusterSocketCreator.java:70)
        at 
org.apache.geode.internal.net.SocketCreator.createServerSocket(SocketCreator.java:529)
        at 
org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.<init>(AcceptorImpl.java:573)
        at 
org.apache.geode.internal.cache.tier.sockets.AcceptorBuilder.create(AcceptorBuilder.java:291)
        at 
org.apache.geode.internal.cache.CacheServerImpl.createAcceptor(CacheServerImpl.java:420)
        at 
org.apache.geode.internal.cache.CacheServerImpl.start(CacheServerImpl.java:377)
        at 
org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:796)
        ... 14 more
{noformat}
It seems that the server is trying to bind the port before the old instance has 
finished shutting down and cleaning up resources, causing the reconnect process 
to halt and stop re-trying, and leaving the cluster with one less member.

We've been able to reproduce the problem only twice in the past few weeks, I've 
attached the two set of artefacts to the ticket:
 - _*cluster_logs_pks_121*_: the member that throws the {{BindException}} 
during reconnect is {{gemfire-cluster-server-1}}.
 - _*cluster_logs_gke_latest_54*_: the member that throws the {{BindException}} 
during reconnect is {{gemfire-cluster-server-0}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to