[ 
https://issues.apache.org/jira/browse/GEODE-9402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440828#comment-17440828
 ] 

Bill Burcham commented on GEODE-9402:
-------------------------------------

h2. Summary

In each of the attached logs, we see the member that logged the BindException 
eventually joining the view (in 8 and 11 seconds respectively).

My suspicion is that what we see here is nondeterminism in the time it takes 
for a port to become available after it is unbound.

Since the members in question do re-join the cluster successfully I don't think 
this is a bug. What do you think [~jjramos] ?
h2. Detailed Analysis of cluster_logs_gke_latest_54

Looking at cluster_logs_gke_latest_54 quorum loss happens:

[Entry id=4208, date=2021/06/23 15:55:48.119 GMT, level=fatal, thread=tid=0x92, 
emitter=Geode Membership View Creator, message=Possible loss of quorum due to 
the loss of 5 cache processes: 
[gemfire-cluster-server-3(gemfire-cluster-server-3:1)<v5>:41000, 
gemfire-cluster-server-1(gemfire-cluster-server-1:1)<v3>:41000, 
gemfire-cluster-locator-1(gemfire-cluster-locator-1:1:locator)<ec><v1>:41000, 
gemfire-cluster-server-2(gemfire-cluster-server-2:1)<v2>:41000, 
gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v0>:41000]
, Host=gemfire-cluster-server-0 , 
mergedFile=/Users/bburcham/Downloads/cluster_logs_gke_latest_54/gemfire-cluster-server-0/gemfire-cluster-server-0-01-01.log]

It takes about two minutes for the network partition to be healed and for a 
coordinator to be designated. It is TBD what part of that two minutes was due 
to the test delaying the healing of the partition, vs what part of that time 
was spent re-forming a cluster after the network partition was healed. Here's 
the coordinator thread starting:

[Entry id=4925, date=2021/06/23 15:57:57.671 GMT, level=info, thread=tid=0x87, 
emitter=ReconnectThread, message=This member is becoming the membership 
coordinator with address 
gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec>:41000
, Host=gemfire-cluster-locator-0 , 
mergedFile=/Users/bburcham/Downloads/cluster_logs_gke_latest_54/gemfire-cluster-locator-0/gemfire-cluster-locator-0.log]

That point in time corresponds to view 21 (the pre-partition view sequence 
ended at view 5):

[Entry id=4960, date=2021/06/23 15:57:58.009 GMT, level=info, thread=tid=0xad, 
emitter=Geode Membership View Creator, message=sending new view 
View[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v0>:41000|21]
 members: 
[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v0>:41000, 
gemfire-cluster-server-0(gemfire-cluster-server-0:1)<v1>:41000\{lead}, 
gemfire-cluster-server-1(gemfire-cluster-server-1:1)<v1>:41000, 
gemfire-cluster-server-3(gemfire-cluster-server-3:1)<v1>:41000, 
gemfire-cluster-server-2(gemfire-cluster-server-2:1)<v1>:41000, 
gemfire-cluster-locator-1(gemfire-cluster-locator-1:1:locator)<ec><v1>:41000]  
crashed: 
[gemfire-cluster-locator-1(gemfire-cluster-locator-1:1:locator)<ec><v1>:41000, 
gemfire-cluster-server-2(gemfire-cluster-server-2:1)<v2>:41000, 
gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v0>:41000]
, Host=gemfire-cluster-locator-0 , 
mergedFile=/Users/bburcham/Downloads/cluster_logs_gke_latest_54/gemfire-cluster-locator-0/gemfire-cluster-locator-0.log]

About a minute later server-0 logs the BindException while reconnecting:

[Entry id=5536, date=2021/06/23 16:00:31.491 GMT, level=error, thread=tid=0x94, 
emitter=ReconnectThread, message=Cache initialization for GemFireCache[id = 
1795575589; isClosing = false; isShutDownAll = false; created = Wed Jun 23 
15:58:29 GMT 2021; server = false; copyOnRead = false; lockLease = 120; 
lockTimeout = 60] failed because:
org.apache.geode.GemFireIOException: While starting cache server CacheServer on 
port=40404 client subscription config policy=none client subscription config 
capacity=1 client subscription config overflow directory=.
    at 
org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:800)
    at 
org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:599)
    at 
org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:339)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4207)
    at 
org.apache.geode.internal.cache.ClusterConfigurationLoader.applyClusterXmlConfiguration(ClusterConfigurationLoader.java:199)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.applyJarAndXmlFromClusterConfig(GemFireCacheImpl.java:1497)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1449)
    at 
org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191)
    at 
org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2668)
    at 
org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2426)
    at 
org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1277)
    at 
org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2326)
    at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1187)
    at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1811)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.BindException: Address already in use (Bind failed)
    at java.base/java.net.PlainSocketImpl.socketBind(Native Method)
    at 
java.base/java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:436)
    at java.base/java.net.ServerSocket.bind(ServerSocket.java:395)
    at 
org.apache.geode.internal.net.SCClusterSocketCreator.createServerSocket(SCClusterSocketCreator.java:70)
    at 
org.apache.geode.internal.net.SocketCreator.createServerSocket(SocketCreator.java:529)
    at 
org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.<init>(AcceptorImpl.java:573)
    at 
org.apache.geode.internal.cache.tier.sockets.AcceptorBuilder.create(AcceptorBuilder.java:291)
    at 
org.apache.geode.internal.cache.CacheServerImpl.createAcceptor(CacheServerImpl.java:420)
    at 
org.apache.geode.internal.cache.CacheServerImpl.start(CacheServerImpl.java:377)
    at 
org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:796)
    ... 14 more
, Host=gemfire-cluster-server-0 , 
mergedFile=/Users/bburcham/Downloads/cluster_logs_gke_latest_54/gemfire-cluster-server-0/gemfire-cluster-server-0.log]

And server-0 is is removed from the view (with designation "shutdown") within a 
second of that:

[Entry id=5585, date=2021/06/23 16:00:32.438 GMT, level=info, thread=tid=0xad, 
emitter=Geode Membership View Creator, message=sending new view 
View[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v0>:41000|22]
 members: 
[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v0>:41000, 
gemfire-cluster-server-1(gemfire-cluster-server-1:1)<v1>:41000\{lead}, 
gemfire-cluster-server-3(gemfire-cluster-server-3:1)<v1>:41000, 
gemfire-cluster-server-2(gemfire-cluster-server-2:1)<v1>:41000, 
gemfire-cluster-locator-1(gemfire-cluster-locator-1:1:locator)<ec><v1>:41000]  
shutdown: [gemfire-cluster-server-0(gemfire-cluster-server-0:1)<v1>:41000]
, Host=gemfire-cluster-locator-0 , 
mergedFile=/Users/bburcham/Downloads/cluster_logs_gke_latest_54/gemfire-cluster-locator-0/gemfire-cluster-locator-0.log]

But we see server-0 right back in the next view within 8 seconds:

[Entry id=5601, date=2021/06/23 16:00:39.844 GMT, level=info, thread=tid=0xad, 
emitter=Geode Membership View Creator, message=sending new view 
View[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v0>:41000|23]
 members: 
[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v0>:41000, 
gemfire-cluster-server-1(gemfire-cluster-server-1:1)<v1>:41000\{lead}, 
gemfire-cluster-server-3(gemfire-cluster-server-3:1)<v1>:41000, 
gemfire-cluster-server-2(gemfire-cluster-server-2:1)<v1>:41000, 
gemfire-cluster-locator-1(gemfire-cluster-locator-1:1:locator)<ec><v1>:41000, 
gemfire-cluster-server-0(gemfire-cluster-server-0:1)<v23>:41000]
, Host=gemfire-cluster-locator-0 , 
mergedFile=/Users/bburcham/Downloads/cluster_logs_gke_latest_54/gemfire-cluster-locator-0/gemfire-cluster-locator-0.log]

The net effect is that server-0 does eventually join the cluster.
h2. Detailed Analysis of cluster_logs_pks_121

Looking at cluster_logs_pks_121 now we see the quorum loss around:

[Entry id=4178, date=2021/06/09 18:44:32.157 GMT, level=fatal, thread=tid=0x81, 
emitter=Geode Membership View Creator, message=Possible loss of quorum due to 
the loss of 5 cache processes: 
[gemfire-cluster-server-3(gemfire-cluster-server-3:1)<v3>:41000, 
gemfire-cluster-server-0(gemfire-cluster-server-0:1)<v5>:41000, 
gemfire-cluster-server-2(gemfire-cluster-server-2:1)<v2>:41000, 
gemfire-cluster-server-1(gemfire-cluster-server-1:1)<v4>:41000, 
gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v0>:41000]
, Host=gemfire-cluster-locator-1 , 
mergedFile=/Users/bburcham/Downloads/cluster_logs_pks_121/gemfire-cluster-locator-1/gemfire-cluster-locator-1-01-01.log]

It takes about 96 seconds for the network partition to heal and for a 
coordinator to be designated:

[Entry id=4731, date=2021/06/09 18:46:08.286 GMT, level=info, thread=tid=0x8e, 
emitter=ReconnectThread, message=This member is becoming the membership 
coordinator with address 
gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec>:41000
, Host=gemfire-cluster-locator-0 , 
mergedFile=/Users/bburcham/Downloads/cluster_logs_pks_121/gemfire-cluster-locator-0/gemfire-cluster-locator-0.log]

That point in time corresponds to view 21 (the pre-partition view sequence 
ended at view 5):

[Entry id=4781, date=2021/06/09 18:46:08.640 GMT, level=info, thread=tid=0xaa, 
emitter=Geode Membership View Creator, message=sending new view 
View[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v0>:41000|21]
 members: 
[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v0>:41000, 
gemfire-cluster-server-1(gemfire-cluster-server-1:1)<v1>:41000\{lead}, 
gemfire-cluster-server-3(gemfire-cluster-server-3:1)<v1>:41000]  crashed: 
[gemfire-cluster-locator-1(gemfire-cluster-locator-1:1:locator)<ec><v1>:41000, 
gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v0>:41000]
, Host=gemfire-cluster-locator-0 , 
mergedFile=/Users/bburcham/Downloads/cluster_logs_pks_121/gemfire-cluster-locator-0/gemfire-cluster-locator-0.log]

Then about 2 minutes, 44 seconds later, we see server-1 log the BindException:

[Entry id=5370, date=2021/06/09 18:48:52.872 GMT, level=error, thread=tid=0x91, 
emitter=ReconnectThread, message=Cache initialization for GemFireCache[id = 
575310555; isClosing = false; isShutDownAll = false; created = Wed Jun 09 
18:46:49 GMT 2021; server = false; copyOnRead = false; lockLease = 120; 
lockTimeout = 60] failed because:
org.apache.geode.GemFireIOException: While starting cache server CacheServer on 
port=40404 client subscription config policy=none client subscription config 
capacity=1 client subscription config overflow directory=.
    at 
org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:800)
    at 
org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:599)
    at 
org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:339)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4207)
    at 
org.apache.geode.internal.cache.ClusterConfigurationLoader.applyClusterXmlConfiguration(ClusterConfigurationLoader.java:197)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.applyJarAndXmlFromClusterConfig(GemFireCacheImpl.java:1497)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1449)
    at 
org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191)
    at 
org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2668)
    at 
org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2426)
    at 
org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1277)
    at 
org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
    at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1183)
    at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1807)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.BindException: Address already in use (Bind failed)
    at java.base/java.net.PlainSocketImpl.socketBind(Native Method)
    at 
java.base/java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:436)
    at java.base/java.net.ServerSocket.bind(ServerSocket.java:395)
    at 
org.apache.geode.internal.net.SCClusterSocketCreator.createServerSocket(SCClusterSocketCreator.java:70)
    at 
org.apache.geode.internal.net.SocketCreator.createServerSocket(SocketCreator.java:529)
    at 
org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.<init>(AcceptorImpl.java:573)
    at 
org.apache.geode.internal.cache.tier.sockets.AcceptorBuilder.create(AcceptorBuilder.java:291)
    at 
org.apache.geode.internal.cache.CacheServerImpl.createAcceptor(CacheServerImpl.java:420)
    at 
org.apache.geode.internal.cache.CacheServerImpl.start(CacheServerImpl.java:377)
    at 
org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:796)
    ... 14 more
, Host=gemfire-cluster-server-1 , 
mergedFile=/Users/bburcham/Downloads/cluster_logs_pks_121/gemfire-cluster-server-1/gemfire-cluster-server-1.log]


And server-1 is dropped from the view about a second later (with designation 
"shutdown"):

[Entry id=5418, date=2021/06/09 18:48:53.883 GMT, level=info, thread=tid=0xaa, 
emitter=Geode Membership View Creator, message=sending new view 
View[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v0>:41000|24]
 members: 
[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v0>:41000, 
gemfire-cluster-server-3(gemfire-cluster-server-3:1)<v1>:41000\{lead}, 
gemfire-cluster-server-0(gemfire-cluster-server-0:1)<v22>:41000, 
gemfire-cluster-server-2(gemfire-cluster-server-2:1)<v22>:41000, 
gemfire-cluster-locator-1(gemfire-cluster-locator-1:1:locator)<ec><v23>:41000]  
shutdown: [gemfire-cluster-server-1(gemfire-cluster-server-1:1)<v1>:41000]
, Host=gemfire-cluster-locator-0 , 
mergedFile=/Users/bburcham/Downloads/cluster_logs_pks_121/gemfire-cluster-locator-0/gemfire-cluster-locator-0.log]

But we see server-1 right back in the next view within 11 seconds:

[Entry id=5434, date=2021/06/09 18:49:04.692 GMT, level=info, thread=tid=0xaa, 
emitter=Geode Membership View Creator, message=sending new view 
View[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v0>:41000|25]
 members: 
[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v0>:41000, 
gemfire-cluster-server-3(gemfire-cluster-server-3:1)<v1>:41000\{lead}, 
gemfire-cluster-server-0(gemfire-cluster-server-0:1)<v22>:41000, 
gemfire-cluster-server-2(gemfire-cluster-server-2:1)<v22>:41000, 
gemfire-cluster-locator-1(gemfire-cluster-locator-1:1:locator)<ec><v23>:41000, 
gemfire-cluster-server-1(gemfire-cluster-server-1:1)<v25>:41000]
, Host=gemfire-cluster-locator-0 , 
mergedFile=/Users/bburcham/Downloads/cluster_logs_pks_121/gemfire-cluster-locator-0/gemfire-cluster-locator-0.log]

 

In both logs, the member that received the BindException eventually did join 
the view.

> Automatic Reconnect Failure: Address already in use
> ---------------------------------------------------
>
>                 Key: GEODE-9402
>                 URL: https://issues.apache.org/jira/browse/GEODE-9402
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>            Reporter: Juan Ramos
>            Assignee: Bill Burcham
>            Priority: Major
>         Attachments: cluster_logs_gke_latest_54.zip, cluster_logs_pks_121.zip
>
>
> There are 2 locators and 4 servers during the test, once they're all up and 
> running the test drops the network connectivity between all members to 
> generate a full network partition and cause all members to shutdown and go 
> into reconnect mode. Upon reaching the mentioned state, the test 
> automatically restores the network connectivity and expects all members to 
> automatically go up again and re-form the distributed system.
>  This works fine most of the time, and we see every member successfully 
> reconnecting to the distributed system:
> {noformat}
> [info 2021/06/23 15:58:12.981 GMT gemfire-cluster-locator-0 <ReconnectThread> 
> tid=0x87] Reconnect completed.
> [info 2021/06/23 15:58:14.726 GMT gemfire-cluster-locator-1 <ReconnectThread> 
> tid=0x86] Reconnect completed.
> [info 2021/06/23 15:58:46.702 GMT gemfire-cluster-server-0 <ReconnectThread> 
> tid=0x94] Reconnect completed.
> [info 2021/06/23 15:58:46.485 GMT gemfire-cluster-server-1 <ReconnectThread> 
> tid=0x96] Reconnect completed.
> [info 2021/06/23 15:58:46.273 GMT gemfire-cluster-server-2 <ReconnectThread> 
> tid=0x97] Reconnect completed.
> [info 2021/06/23 15:58:46.902 GMT gemfire-cluster-server-3 <ReconnectThread> 
> tid=0x95] Reconnect completed.
> {noformat}
> In some rare occasions, though, one of the servers fails during the reconnect 
> phase with the following exception:
> {noformat}
> [error 2021/06/09 18:48:52.872 GMT gemfire-cluster-server-1 <ReconnectThread> 
> tid=0x91] Cache initialization for GemFireCache[id = 575310555; isClosing = 
> false; isShutDownAll = false; created = Wed Jun 09 18:46:49 GMT 2021; server 
> = false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed 
> because:
> org.apache.geode.GemFireIOException: While starting cache server CacheServer 
> on port=40404 client subscription config policy=none client subscription 
> config capacity=1 client subscription config overflow directory=.
>       at 
> org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:800)
>       at 
> org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:599)
>       at 
> org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:339)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4207)
>       at 
> org.apache.geode.internal.cache.ClusterConfigurationLoader.applyClusterXmlConfiguration(ClusterConfigurationLoader.java:197)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.applyJarAndXmlFromClusterConfig(GemFireCacheImpl.java:1497)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1449)
>       at 
> org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2668)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2426)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1277)
>       at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
>       at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1183)
>       at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1807)
>       at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.net.BindException: Address already in use (Bind failed)
>       at java.base/java.net.PlainSocketImpl.socketBind(Native Method)
>       at 
> java.base/java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:436)
>       at java.base/java.net.ServerSocket.bind(ServerSocket.java:395)
>       at 
> org.apache.geode.internal.net.SCClusterSocketCreator.createServerSocket(SCClusterSocketCreator.java:70)
>       at 
> org.apache.geode.internal.net.SocketCreator.createServerSocket(SocketCreator.java:529)
>       at 
> org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.<init>(AcceptorImpl.java:573)
>       at 
> org.apache.geode.internal.cache.tier.sockets.AcceptorBuilder.create(AcceptorBuilder.java:291)
>       at 
> org.apache.geode.internal.cache.CacheServerImpl.createAcceptor(CacheServerImpl.java:420)
>       at 
> org.apache.geode.internal.cache.CacheServerImpl.start(CacheServerImpl.java:377)
>       at 
> org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:796)
>       ... 14 more
> {noformat}
> It seems that the server is trying to bind the port before the old instance 
> has finished shutting down and cleaning up resources, causing the reconnect 
> process to halt and stop re-trying, and leaving the cluster with one less 
> member.
> We've been able to reproduce the problem only twice in the past few weeks, 
> I've attached the two set of artefacts to the ticket:
>  - _*cluster_logs_pks_121*_: the member that throws the {{BindException}} 
> during reconnect is {{gemfire-cluster-server-1}}.
>  - _*cluster_logs_gke_latest_54*_: the member that throws the 
> {{BindException}} during reconnect is {{gemfire-cluster-server-0}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to