Barrett Oglesby created GEODE-7750:
--------------------------------------

             Summary: If a member is force disconnected from the distributed 
system repeatedly, it reconnects each time, but after the second force 
disconnect, it shuts down after reconnecting
                 Key: GEODE-7750
                 URL: https://issues.apache.org/jira/browse/GEODE-7750
             Project: Geode
          Issue Type: Bug
          Components: membership
            Reporter: Barrett Oglesby


If I start a server using gfsh and repeatedly invoke the 
MembershipManagerHelper crashDistributedSystem method to force it to disconnect 
from the distributed system, it reconnects each time, but after the second 
force disconnect, the server shuts down after reconnecting.

Here is truncated logging for the first disconnect:
{noformat}
[info 2020/01/29 11:41:07.613 PST <Function Execution Processor3> tid=0x44] 
crashing distributed system: Connected "server1" (id=76d0ecd7) to distributed 
system using locators "localhost[23456]" logging to cacheserver.log  started at 
Wed Jan 29 11:32:37 PST 2020

[info 2020/01/29 11:41:07.613 PST <Function Execution Processor3> tid=0x44] 
GroupMembershipService.beSick invoked for 
192.168.99.1(server1:51825)<ec><v3>:41001 - simulating sickness

[info 2020/01/29 11:41:07.613 PST <Function Execution Processor3> tid=0x44] 
GroupMembershipService.playDead invoked for 
192.168.99.1(server1:51825)<ec><v3>:41001

[fatal 2020/01/29 11:41:07.613 PST <Function Execution Processor3> tid=0x44] 
Membership service failure: for testing
org.apache.geode.distributed.internal.membership.api.MemberDisconnectedException:
 for testing
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.forceDisconnect(GMSMembership.java:2004)
        at 
org.apache.geode.distributed.internal.membership.api.MembershipManagerHelper.crashDistributedSystem(MembershipManagerHelper.java:66)
        at 
MembershipCycleHealthFunction.execute(MembershipCycleHealthFunction.java:20)
        at 
org.apache.geode.internal.cache.execute.AbstractExecution.executeFunctionLocally(AbstractExecution.java:328)
        at 
org.apache.geode.internal.cache.execute.AbstractExecution.lambda$executeFunctionOnLocalNode$1(AbstractExecution.java:299)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at 
org.apache.geode.distributed.internal.ClusterOperationExecutors.runUntilShutdown(ClusterOperationExecutors.java:449)
        at 
org.apache.geode.distributed.internal.ClusterOperationExecutors.doFunctionExecutionThread(ClusterOperationExecutors.java:379)
        at 
org.apache.geode.logging.internal.executors.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:119)
        at java.lang.Thread.run(Thread.java:745)

[info 2020/01/29 11:41:07.616 PST <Function Execution Processor3> tid=0x44] 
CacheServer configuration saved

[info 2020/01/29 11:41:07.626 PST <Geode Failure Detection Server thread 1> 
tid=0x1f] GMSHealthMonitor server thread exiting

[info 2020/01/29 11:41:07.626 PST <DisconnectThread> tid=0x47] Stopping 
membership services

[info 2020/01/29 11:41:07.630 PST <ReconnectThread> tid=0x47] Disconnecting old 
DistributedSystem to prepare for a reconnect attempt

[info 2020/01/29 11:41:07.632 PST <ReconnectThread> tid=0x47] GemFireCache[id = 
101500074; isClosing = true; isShutDownAll = false; created = Wed Jan 29 
11:32:38 PST 2020; server = true; copyOnRead = false; lockLease = 120; 
lockTimeout = 60]: Now closing.

[info 2020/01/29 11:41:07.633 PST <ReconnectThread> tid=0x47] Cache server on 
port 40401 is shutting down.

[info 2020/01/29 11:41:07.657 PST <ReconnectThread> tid=0x47] Shutting down 
DistributionManager 192.168.99.1(server1:51825)<ec><v3>:41001. At least one 
Exception occurred.

[info 2020/01/29 11:41:07.759 PST <ReconnectThread> tid=0x47] Now closing 
distribution for 192.168.99.1(server1:51825)<ec><v3>:41001

[info 2020/01/29 11:41:07.760 PST <ReconnectThread> tid=0x47] 
DistributionManager stopped in 103ms.

[info 2020/01/29 11:41:07.760 PST <ReconnectThread> tid=0x47] Marking 
DistributionManager 192.168.99.1(server1:51825)<ec><v3>:41001 as closed.
{noformat}
 Here is truncated logging for the first reconnect:
{noformat}
[info 2020/01/29 11:42:07.755 PST <ReconnectThread> tid=0x47] Attempting to 
reconnect to the distributed system.  This is attempt #1.

[info 2020/01/29 11:42:07.775 PST <ReconnectThread> tid=0x47] Starting 
membership services

[info 2020/01/29 11:42:07.778 PST <ReconnectThread> tid=0x47] Established local 
address 192.168.99.1(server1:51825):41001

[info 2020/01/29 11:42:07.779 PST <ReconnectThread> tid=0x47] GemFire P2P 
Listener started on /192.168.99.1:50431

[info 2020/01/29 11:42:07.793 PST <ReconnectThread> tid=0x47] Attempting to 
join the distributed system through coordinator 
192.168.99.1(locator:51077:locator)<ec><v0>:41000 using address 
192.168.99.1(server1:51825)<ec>:41001

[info 2020/01/29 11:42:08.104 PST <unicast receiver,boglesbymac-63966> 
tid=0x1b] received new view: 
View[192.168.99.1(locator:51077:locator)<ec><v0>:41000|5] members: 
[192.168.99.1(locator:51077:locator)<ec><v0>:41000, 
192.168.99.1(server2:51826)<ec><v3>:41002{lead}, 
192.168.99.1(server1:51825)<ec><v5>:41001]
old view is: null

[info 2020/01/29 11:42:08.105 PST <ReconnectThread> tid=0x47] Starting 
DistributionManager 192.168.99.1(server1:51825)<ec><v5>:41001.  (took 331 ms)

[info 2020/01/29 11:42:08.105 PST <ReconnectThread> tid=0x47] Initial 
(distribution manager) view, 
View[192.168.99.1(locator:51077:locator)<ec><v0>:41000|5] members: 
[192.168.99.1(locator:51077:locator)<ec><v0>:41000, 
192.168.99.1(server2:51826)<ec><v3>:41002{lead}, 
192.168.99.1(server1:51825)<ec><v5>:41001]

[info 2020/01/29 11:42:08.111 PST <P2P message reader for 
192.168.99.1(locator:51077:locator)<ec><v0>:41000 shared unordered uid=4 
port=64693> tid=0x5c] Member 192.168.99.1(locator:51077:locator)<ec><v0>:41000 
is  equivalent or in the same redundancy zone.

[info 2020/01/29 11:42:08.113 PST <P2P message reader for 
192.168.99.1(server2:51826)<ec><v3>:41002 shared unordered uid=5 port=64696> 
tid=0x61] Member 192.168.99.1(server2:51826)<ec><v3>:41002 is  equivalent or in 
the same redundancy zone.

[info 2020/01/29 11:42:08.113 PST <ReconnectThread> tid=0x47] 
DistributionManager 192.168.99.1(server1:51825)<ec><v5>:41001 started on 
localhost[23456]. There were 2 other DMs. others: 
[192.168.99.1(locator:51077:locator)<ec><v0>:41000, 
192.168.99.1(server2:51826)<ec><v3>:41002]  (took 343 ms) 

[info 2020/01/29 11:42:08.392 PST <ReconnectThread> tid=0x47] Cache server 
connection listener bound to address 192.168.99.1-0.0.0.0/0.0.0.0:40401 with 
backlog 1280.

[info 2020/01/29 11:42:08.395 PST <ReconnectThread> tid=0x47] CacheServer 
Configuration:  port=40401 max-connections=800 max-threads=0 
notify-by-subscription=true socket-buffer-size=32768 
maximum-time-between-pings=60000 maximum-message-count=230000 
message-time-to-live=180 eviction-policy=none capacity=1 overflow directory=. 
groups=[] loadProbe=ConnectionCountProbe loadPollInterval=5000 tcpNoDelay=true

[info 2020/01/29 11:42:23.399 PST <ReconnectThread> tid=0x47] Reconnect 
completed.
New DistributedSystem is Connected "server1" (id=17253324) to distributed 
system using locators "localhost[23456]" logging to cacheserver.log  started at 
Wed Jan 29 11:42:07 PST 2020
New Cache is GemFireCache[id = 1930372133; isClosing = false; isShutDownAll = 
false; created = Wed Jan 29 11:42:08 PST 2020; server = false; copyOnRead = 
false; lockLease = 120; lockTimeout = 60]
{noformat}
The second disconnect looks similar to the first.

The second reconnect looks similar to the first too except for key difference.
{noformat}
[info 2020/01/29 11:44:54.243 PST <ReconnectThread> tid=0x78] Attempting to 
reconnect to the distributed system.  This is attempt #1.

[info 2020/01/29 11:44:54.259 PST <ReconnectThread> tid=0x78] Starting 
membership services

[info 2020/01/29 11:44:54.260 PST <ReconnectThread> tid=0x78] Established local 
address 192.168.99.1(server1:51825):41001

[info 2020/01/29 11:44:54.261 PST <ReconnectThread> tid=0x78] GemFire P2P 
Listener started on /192.168.99.1:57338

[info 2020/01/29 11:44:54.273 PST <ReconnectThread> tid=0x78] Attempting to 
join the distributed system through coordinator 
192.168.99.1(locator:51077:locator)<ec><v0>:41000 using address 
192.168.99.1(server1:51825)<ec>:41001

[info 2020/01/29 11:44:54.580 PST <unicast receiver,boglesbymac-63966> 
tid=0x1b] received new view: 
View[192.168.99.1(locator:51077:locator)<ec><v0>:41000|7] members: 
[192.168.99.1(locator:51077:locator)<ec><v0>:41000, 
192.168.99.1(server2:51826)<ec><v3>:41002{lead}, 
192.168.99.1(server1:51825)<ec><v7>:41001]
old view is: null

[info 2020/01/29 11:44:54.581 PST <ReconnectThread> tid=0x78] Starting 
DistributionManager 192.168.99.1(server1:51825)<ec><v7>:41001.  (took 324 ms)

[info 2020/01/29 11:44:54.581 PST <ReconnectThread> tid=0x78] Initial 
(distribution manager) view, 
View[192.168.99.1(locator:51077:locator)<ec><v0>:41000|7] members: 
[192.168.99.1(locator:51077:locator)<ec><v0>:41000, 
192.168.99.1(server2:51826)<ec><v3>:41002{lead}, 
192.168.99.1(server1:51825)<ec><v7>:41001]

[info 2020/01/29 11:44:54.690 PST <P2P message reader for 
192.168.99.1(locator:51077:locator)<ec><v0>:41000 shared unordered uid=5 
port=65360> tid=0x8d] Member 192.168.99.1(locator:51077:locator)<ec><v0>:41000 
is  equivalent or in the same redundancy zone.

[info 2020/01/29 11:44:54.691 PST <P2P message reader for 
192.168.99.1(server2:51826)<ec><v3>:41002 shared unordered uid=6 port=65363> 
tid=0x92] Member 192.168.99.1(server2:51826)<ec><v3>:41002 is  equivalent or in 
the same redundancy zone.

[info 2020/01/29 11:44:54.692 PST <ReconnectThread> tid=0x78] 
DistributionManager 192.168.99.1(server1:51825)<ec><v7>:41001 started on 
localhost[23456]. There were 2 other DMs. others: 
[192.168.99.1(locator:51077:locator)<ec><v0>:41000, 
192.168.99.1(server2:51826)<ec><v3>:41002]  (took 439 ms) 

[info 2020/01/29 11:45:09.931 PST <ReconnectThread> tid=0x78] Reconnect 
completed.
New DistributedSystem is Connected "server1" (id=ed24822) to distributed system 
using locators "localhost[23456]" logging to cacheserver.log  started at Wed 
Jan 29 11:44:54 PST 2020
New Cache is GemFireCache[id = 198638695; isClosing = false; isShutDownAll = 
false; created = Wed Jan 29 11:44:54 PST 2020; server = false; copyOnRead = 
false; lockLease = 120; lockTimeout = 60]
{noformat}
As soon as the ReconnectThread, the Distributed system shutdown hook is 
invoked, and the member shuts down:
{noformat}
[info 2020/01/29 11:45:09.932 PST <Distributed system shutdown hook> tid=0xd] 
VM is exiting - shutting down distributed system

[info 2020/01/29 11:45:09.940 PST <Distributed system shutdown hook> tid=0xd] 
GemFireCache[id = 198638695; isClosing = true; isShutDownAll = false; created = 
Wed Jan 29 11:44:54 PST 2020; server = false; copyOnRead = false; lockLease = 
120; lockTimeout = 60]: Now closing.

[info 2020/01/29 11:45:09.948 PST <Distributed system shutdown hook> tid=0xd] 
Shutting down DistributionManager 192.168.99.1(server1:51825)<ec><v7>:41001. 

[info 2020/01/29 11:45:10.054 PST <Distributed system shutdown hook> tid=0xd] 
Now closing distribution for 192.168.99.1(server1:51825)<ec><v7>:41001

[info 2020/01/29 11:45:10.054 PST <Distributed system shutdown hook> tid=0xd] 
Stopping membership services

[info 2020/01/29 11:45:10.056 PST <Geode Failure Detection Server thread 1> 
tid=0x8a] GMSHealthMonitor server thread exiting

[info 2020/01/29 11:45:10.061 PST <Distributed system shutdown hook> tid=0xd] 
DistributionManager stopped in 113ms.

[info 2020/01/29 11:45:10.061 PST <Distributed system shutdown hook> tid=0xd] 
Marking DistributionManager 192.168.99.1(server1:51825)<ec><v7>:41001 as closed.
{noformat}
That is because the Cache Server Acceptor thread is not restarted during the 
second reconnect. Logging with 'Cache server connection listener bound to 
address ...' doesn't occur in the second reconnect.

A thread like this exists in the member after the first reconnect, but not the 
second:
{noformat}
"Cache Server Acceptor 0.0.0.0/0.0.0.0:40402 local port: 40402" #63 prio=5 
os_prio=31 tid=0x00007f9142bb3800 nid=0xac03 runnable [0x000070000a7cf000]
   java.lang.Thread.State: RUNNABLE
        at java.net.PlainSocketImpl.socketAccept(Native Method)
        at 
java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
        at java.net.ServerSocket.implAccept(ServerSocket.java:545)
        at java.net.ServerSocket.accept(ServerSocket.java:513)
        at 
org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.accept(AcceptorImpl.java:1263)
        at 
org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.run(AcceptorImpl.java:1202)
        at java.lang.Thread.run(Thread.java:745)
{noformat}
Here is a stack trace showing the code path in the first reconnect. This 
doesn't happen in the second reconnect.
{noformat}
[info 2020/01/29 12:34:01.956 PST <ReconnectThread> tid=0x45] Cache server 
connection listener bound to address 192.168.99.1-0.0.0.0/0.0.0.0:40401 with 
backlog 1280.
java.lang.Exception
        at 
org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.<init>(AcceptorImpl.java:605)
        at 
org.apache.geode.internal.cache.tier.sockets.AcceptorBuilder.create(AcceptorBuilder.java:303)
        at 
org.apache.geode.internal.cache.CacheServerImpl.createAcceptor(CacheServerImpl.java:429)
        at 
org.apache.geode.internal.cache.CacheServerImpl.start(CacheServerImpl.java:386)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.createAndStartCacheServers(InternalDistributedSystem.java:2750)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2634)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2390)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1242)
        at 
org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2311)
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1283)
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:2027)
        at java.lang.Thread.run(Thread.java:745)
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to