Juan Ramos created GEODE-9402:
---------------------------------
Summary: Automatic Reconnect Failure: Address already in use
Key: GEODE-9402
URL: https://issues.apache.org/jira/browse/GEODE-9402
Project: Geode
Issue Type: Bug
Components: membership
Reporter: Juan Ramos
There are 2 locators and 4 servers during the test, once they're all up and
running the test drops the network connectivity between all members to generate
a full network partition and cause all members to shutdown and go into
reconnect mode. Upon reaching the mentioned state, the test automatically
restores the network connectivity and expects all members to automatically go
up again and re-form the distributed system.
This works fine most of the time, and we see every member successfully
reconnecting to the distributed system:
{noformat}
[info 2021/06/23 15:58:12.981 GMT gemfire-cluster-locator-0 <ReconnectThread>
tid=0x87] Reconnect completed.
[info 2021/06/23 15:58:14.726 GMT gemfire-cluster-locator-1 <ReconnectThread>
tid=0x86] Reconnect completed.
[info 2021/06/23 15:58:46.702 GMT gemfire-cluster-server-0 <ReconnectThread>
tid=0x94] Reconnect completed.
[info 2021/06/23 15:58:46.485 GMT gemfire-cluster-server-1 <ReconnectThread>
tid=0x96] Reconnect completed.
[info 2021/06/23 15:58:46.273 GMT gemfire-cluster-server-2 <ReconnectThread>
tid=0x97] Reconnect completed.
[info 2021/06/23 15:58:46.902 GMT gemfire-cluster-server-3 <ReconnectThread>
tid=0x95] Reconnect completed.
{noformat}
In some rare occasions, though, one of the servers fails during the reconnect
phase with the following exception:
{noformat}
[error 2021/06/09 18:48:52.872 GMT gemfire-cluster-server-1 <ReconnectThread>
tid=0x91] Cache initialization for GemFireCache[id = 575310555; isClosing =
false; isShutDownAll = false; created = Wed Jun 09 18:46:49 GMT 2021; server =
false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed because:
org.apache.geode.GemFireIOException: While starting cache server CacheServer on
port=40404 client subscription config policy=none client subscription config
capacity=1 client subscription config overflow directory=.
at
org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:800)
at
org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:599)
at
org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:339)
at
org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4207)
at
org.apache.geode.internal.cache.ClusterConfigurationLoader.applyClusterXmlConfiguration(ClusterConfigurationLoader.java:197)
at
org.apache.geode.internal.cache.GemFireCacheImpl.applyJarAndXmlFromClusterConfig(GemFireCacheImpl.java:1497)
at
org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1449)
at
org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191)
at
org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2668)
at
org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2426)
at
org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1277)
at
org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
at
org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1183)
at
org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1807)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.BindException: Address already in use (Bind failed)
at java.base/java.net.PlainSocketImpl.socketBind(Native Method)
at
java.base/java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:436)
at java.base/java.net.ServerSocket.bind(ServerSocket.java:395)
at
org.apache.geode.internal.net.SCClusterSocketCreator.createServerSocket(SCClusterSocketCreator.java:70)
at
org.apache.geode.internal.net.SocketCreator.createServerSocket(SocketCreator.java:529)
at
org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.<init>(AcceptorImpl.java:573)
at
org.apache.geode.internal.cache.tier.sockets.AcceptorBuilder.create(AcceptorBuilder.java:291)
at
org.apache.geode.internal.cache.CacheServerImpl.createAcceptor(CacheServerImpl.java:420)
at
org.apache.geode.internal.cache.CacheServerImpl.start(CacheServerImpl.java:377)
at
org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:796)
... 14 more
{noformat}
It seems that the server is trying to bind the port before the old instance has
finished shutting down and cleaning up resources, causing the reconnect process
to halt and stop re-trying, and leaving the cluster with one less member.
We've been able to reproduce the problem only twice in the past few weeks, I've
attached the two set of artefacts to the ticket:
- _*cluster_logs_pks_121*_: the member that throws the {{BindException}}
during reconnect is {{gemfire-cluster-server-1}}.
- _*cluster_logs_gke_latest_54*_: the member that throws the {{BindException}}
during reconnect is {{gemfire-cluster-server-0}}.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)