Barrett Oglesby created GEODE-7750:
--------------------------------------
Summary: If a member is force disconnected from the distributed
system repeatedly, it reconnects each time, but after the second force
disconnect, it shuts down after reconnecting
Key: GEODE-7750
URL: https://issues.apache.org/jira/browse/GEODE-7750
Project: Geode
Issue Type: Bug
Components: membership
Reporter: Barrett Oglesby
If I start a server using gfsh and repeatedly invoke the
MembershipManagerHelper crashDistributedSystem method to force it to disconnect
from the distributed system, it reconnects each time, but after the second
force disconnect, the server shuts down after reconnecting.
Here is truncated logging for the first disconnect:
{noformat}
[info 2020/01/29 11:41:07.613 PST <Function Execution Processor3> tid=0x44]
crashing distributed system: Connected "server1" (id=76d0ecd7) to distributed
system using locators "localhost[23456]" logging to cacheserver.log started at
Wed Jan 29 11:32:37 PST 2020
[info 2020/01/29 11:41:07.613 PST <Function Execution Processor3> tid=0x44]
GroupMembershipService.beSick invoked for
192.168.99.1(server1:51825)<ec><v3>:41001 - simulating sickness
[info 2020/01/29 11:41:07.613 PST <Function Execution Processor3> tid=0x44]
GroupMembershipService.playDead invoked for
192.168.99.1(server1:51825)<ec><v3>:41001
[fatal 2020/01/29 11:41:07.613 PST <Function Execution Processor3> tid=0x44]
Membership service failure: for testing
org.apache.geode.distributed.internal.membership.api.MemberDisconnectedException:
for testing
at
org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.forceDisconnect(GMSMembership.java:2004)
at
org.apache.geode.distributed.internal.membership.api.MembershipManagerHelper.crashDistributedSystem(MembershipManagerHelper.java:66)
at
MembershipCycleHealthFunction.execute(MembershipCycleHealthFunction.java:20)
at
org.apache.geode.internal.cache.execute.AbstractExecution.executeFunctionLocally(AbstractExecution.java:328)
at
org.apache.geode.internal.cache.execute.AbstractExecution.lambda$executeFunctionOnLocalNode$1(AbstractExecution.java:299)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at
org.apache.geode.distributed.internal.ClusterOperationExecutors.runUntilShutdown(ClusterOperationExecutors.java:449)
at
org.apache.geode.distributed.internal.ClusterOperationExecutors.doFunctionExecutionThread(ClusterOperationExecutors.java:379)
at
org.apache.geode.logging.internal.executors.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:119)
at java.lang.Thread.run(Thread.java:745)
[info 2020/01/29 11:41:07.616 PST <Function Execution Processor3> tid=0x44]
CacheServer configuration saved
[info 2020/01/29 11:41:07.626 PST <Geode Failure Detection Server thread 1>
tid=0x1f] GMSHealthMonitor server thread exiting
[info 2020/01/29 11:41:07.626 PST <DisconnectThread> tid=0x47] Stopping
membership services
[info 2020/01/29 11:41:07.630 PST <ReconnectThread> tid=0x47] Disconnecting old
DistributedSystem to prepare for a reconnect attempt
[info 2020/01/29 11:41:07.632 PST <ReconnectThread> tid=0x47] GemFireCache[id =
101500074; isClosing = true; isShutDownAll = false; created = Wed Jan 29
11:32:38 PST 2020; server = true; copyOnRead = false; lockLease = 120;
lockTimeout = 60]: Now closing.
[info 2020/01/29 11:41:07.633 PST <ReconnectThread> tid=0x47] Cache server on
port 40401 is shutting down.
[info 2020/01/29 11:41:07.657 PST <ReconnectThread> tid=0x47] Shutting down
DistributionManager 192.168.99.1(server1:51825)<ec><v3>:41001. At least one
Exception occurred.
[info 2020/01/29 11:41:07.759 PST <ReconnectThread> tid=0x47] Now closing
distribution for 192.168.99.1(server1:51825)<ec><v3>:41001
[info 2020/01/29 11:41:07.760 PST <ReconnectThread> tid=0x47]
DistributionManager stopped in 103ms.
[info 2020/01/29 11:41:07.760 PST <ReconnectThread> tid=0x47] Marking
DistributionManager 192.168.99.1(server1:51825)<ec><v3>:41001 as closed.
{noformat}
Here is truncated logging for the first reconnect:
{noformat}
[info 2020/01/29 11:42:07.755 PST <ReconnectThread> tid=0x47] Attempting to
reconnect to the distributed system. This is attempt #1.
[info 2020/01/29 11:42:07.775 PST <ReconnectThread> tid=0x47] Starting
membership services
[info 2020/01/29 11:42:07.778 PST <ReconnectThread> tid=0x47] Established local
address 192.168.99.1(server1:51825):41001
[info 2020/01/29 11:42:07.779 PST <ReconnectThread> tid=0x47] GemFire P2P
Listener started on /192.168.99.1:50431
[info 2020/01/29 11:42:07.793 PST <ReconnectThread> tid=0x47] Attempting to
join the distributed system through coordinator
192.168.99.1(locator:51077:locator)<ec><v0>:41000 using address
192.168.99.1(server1:51825)<ec>:41001
[info 2020/01/29 11:42:08.104 PST <unicast receiver,boglesbymac-63966>
tid=0x1b] received new view:
View[192.168.99.1(locator:51077:locator)<ec><v0>:41000|5] members:
[192.168.99.1(locator:51077:locator)<ec><v0>:41000,
192.168.99.1(server2:51826)<ec><v3>:41002{lead},
192.168.99.1(server1:51825)<ec><v5>:41001]
old view is: null
[info 2020/01/29 11:42:08.105 PST <ReconnectThread> tid=0x47] Starting
DistributionManager 192.168.99.1(server1:51825)<ec><v5>:41001. (took 331 ms)
[info 2020/01/29 11:42:08.105 PST <ReconnectThread> tid=0x47] Initial
(distribution manager) view,
View[192.168.99.1(locator:51077:locator)<ec><v0>:41000|5] members:
[192.168.99.1(locator:51077:locator)<ec><v0>:41000,
192.168.99.1(server2:51826)<ec><v3>:41002{lead},
192.168.99.1(server1:51825)<ec><v5>:41001]
[info 2020/01/29 11:42:08.111 PST <P2P message reader for
192.168.99.1(locator:51077:locator)<ec><v0>:41000 shared unordered uid=4
port=64693> tid=0x5c] Member 192.168.99.1(locator:51077:locator)<ec><v0>:41000
is equivalent or in the same redundancy zone.
[info 2020/01/29 11:42:08.113 PST <P2P message reader for
192.168.99.1(server2:51826)<ec><v3>:41002 shared unordered uid=5 port=64696>
tid=0x61] Member 192.168.99.1(server2:51826)<ec><v3>:41002 is equivalent or in
the same redundancy zone.
[info 2020/01/29 11:42:08.113 PST <ReconnectThread> tid=0x47]
DistributionManager 192.168.99.1(server1:51825)<ec><v5>:41001 started on
localhost[23456]. There were 2 other DMs. others:
[192.168.99.1(locator:51077:locator)<ec><v0>:41000,
192.168.99.1(server2:51826)<ec><v3>:41002] (took 343 ms)
[info 2020/01/29 11:42:08.392 PST <ReconnectThread> tid=0x47] Cache server
connection listener bound to address 192.168.99.1-0.0.0.0/0.0.0.0:40401 with
backlog 1280.
[info 2020/01/29 11:42:08.395 PST <ReconnectThread> tid=0x47] CacheServer
Configuration: port=40401 max-connections=800 max-threads=0
notify-by-subscription=true socket-buffer-size=32768
maximum-time-between-pings=60000 maximum-message-count=230000
message-time-to-live=180 eviction-policy=none capacity=1 overflow directory=.
groups=[] loadProbe=ConnectionCountProbe loadPollInterval=5000 tcpNoDelay=true
[info 2020/01/29 11:42:23.399 PST <ReconnectThread> tid=0x47] Reconnect
completed.
New DistributedSystem is Connected "server1" (id=17253324) to distributed
system using locators "localhost[23456]" logging to cacheserver.log started at
Wed Jan 29 11:42:07 PST 2020
New Cache is GemFireCache[id = 1930372133; isClosing = false; isShutDownAll =
false; created = Wed Jan 29 11:42:08 PST 2020; server = false; copyOnRead =
false; lockLease = 120; lockTimeout = 60]
{noformat}
The second disconnect looks similar to the first.
The second reconnect looks similar to the first too except for key difference.
{noformat}
[info 2020/01/29 11:44:54.243 PST <ReconnectThread> tid=0x78] Attempting to
reconnect to the distributed system. This is attempt #1.
[info 2020/01/29 11:44:54.259 PST <ReconnectThread> tid=0x78] Starting
membership services
[info 2020/01/29 11:44:54.260 PST <ReconnectThread> tid=0x78] Established local
address 192.168.99.1(server1:51825):41001
[info 2020/01/29 11:44:54.261 PST <ReconnectThread> tid=0x78] GemFire P2P
Listener started on /192.168.99.1:57338
[info 2020/01/29 11:44:54.273 PST <ReconnectThread> tid=0x78] Attempting to
join the distributed system through coordinator
192.168.99.1(locator:51077:locator)<ec><v0>:41000 using address
192.168.99.1(server1:51825)<ec>:41001
[info 2020/01/29 11:44:54.580 PST <unicast receiver,boglesbymac-63966>
tid=0x1b] received new view:
View[192.168.99.1(locator:51077:locator)<ec><v0>:41000|7] members:
[192.168.99.1(locator:51077:locator)<ec><v0>:41000,
192.168.99.1(server2:51826)<ec><v3>:41002{lead},
192.168.99.1(server1:51825)<ec><v7>:41001]
old view is: null
[info 2020/01/29 11:44:54.581 PST <ReconnectThread> tid=0x78] Starting
DistributionManager 192.168.99.1(server1:51825)<ec><v7>:41001. (took 324 ms)
[info 2020/01/29 11:44:54.581 PST <ReconnectThread> tid=0x78] Initial
(distribution manager) view,
View[192.168.99.1(locator:51077:locator)<ec><v0>:41000|7] members:
[192.168.99.1(locator:51077:locator)<ec><v0>:41000,
192.168.99.1(server2:51826)<ec><v3>:41002{lead},
192.168.99.1(server1:51825)<ec><v7>:41001]
[info 2020/01/29 11:44:54.690 PST <P2P message reader for
192.168.99.1(locator:51077:locator)<ec><v0>:41000 shared unordered uid=5
port=65360> tid=0x8d] Member 192.168.99.1(locator:51077:locator)<ec><v0>:41000
is equivalent or in the same redundancy zone.
[info 2020/01/29 11:44:54.691 PST <P2P message reader for
192.168.99.1(server2:51826)<ec><v3>:41002 shared unordered uid=6 port=65363>
tid=0x92] Member 192.168.99.1(server2:51826)<ec><v3>:41002 is equivalent or in
the same redundancy zone.
[info 2020/01/29 11:44:54.692 PST <ReconnectThread> tid=0x78]
DistributionManager 192.168.99.1(server1:51825)<ec><v7>:41001 started on
localhost[23456]. There were 2 other DMs. others:
[192.168.99.1(locator:51077:locator)<ec><v0>:41000,
192.168.99.1(server2:51826)<ec><v3>:41002] (took 439 ms)
[info 2020/01/29 11:45:09.931 PST <ReconnectThread> tid=0x78] Reconnect
completed.
New DistributedSystem is Connected "server1" (id=ed24822) to distributed system
using locators "localhost[23456]" logging to cacheserver.log started at Wed
Jan 29 11:44:54 PST 2020
New Cache is GemFireCache[id = 198638695; isClosing = false; isShutDownAll =
false; created = Wed Jan 29 11:44:54 PST 2020; server = false; copyOnRead =
false; lockLease = 120; lockTimeout = 60]
{noformat}
As soon as the ReconnectThread, the Distributed system shutdown hook is
invoked, and the member shuts down:
{noformat}
[info 2020/01/29 11:45:09.932 PST <Distributed system shutdown hook> tid=0xd]
VM is exiting - shutting down distributed system
[info 2020/01/29 11:45:09.940 PST <Distributed system shutdown hook> tid=0xd]
GemFireCache[id = 198638695; isClosing = true; isShutDownAll = false; created =
Wed Jan 29 11:44:54 PST 2020; server = false; copyOnRead = false; lockLease =
120; lockTimeout = 60]: Now closing.
[info 2020/01/29 11:45:09.948 PST <Distributed system shutdown hook> tid=0xd]
Shutting down DistributionManager 192.168.99.1(server1:51825)<ec><v7>:41001.
[info 2020/01/29 11:45:10.054 PST <Distributed system shutdown hook> tid=0xd]
Now closing distribution for 192.168.99.1(server1:51825)<ec><v7>:41001
[info 2020/01/29 11:45:10.054 PST <Distributed system shutdown hook> tid=0xd]
Stopping membership services
[info 2020/01/29 11:45:10.056 PST <Geode Failure Detection Server thread 1>
tid=0x8a] GMSHealthMonitor server thread exiting
[info 2020/01/29 11:45:10.061 PST <Distributed system shutdown hook> tid=0xd]
DistributionManager stopped in 113ms.
[info 2020/01/29 11:45:10.061 PST <Distributed system shutdown hook> tid=0xd]
Marking DistributionManager 192.168.99.1(server1:51825)<ec><v7>:41001 as closed.
{noformat}
That is because the Cache Server Acceptor thread is not restarted during the
second reconnect. Logging with 'Cache server connection listener bound to
address ...' doesn't occur in the second reconnect.
A thread like this exists in the member after the first reconnect, but not the
second:
{noformat}
"Cache Server Acceptor 0.0.0.0/0.0.0.0:40402 local port: 40402" #63 prio=5
os_prio=31 tid=0x00007f9142bb3800 nid=0xac03 runnable [0x000070000a7cf000]
java.lang.Thread.State: RUNNABLE
at java.net.PlainSocketImpl.socketAccept(Native Method)
at
java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at
org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.accept(AcceptorImpl.java:1263)
at
org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.run(AcceptorImpl.java:1202)
at java.lang.Thread.run(Thread.java:745)
{noformat}
Here is a stack trace showing the code path in the first reconnect. This
doesn't happen in the second reconnect.
{noformat}
[info 2020/01/29 12:34:01.956 PST <ReconnectThread> tid=0x45] Cache server
connection listener bound to address 192.168.99.1-0.0.0.0/0.0.0.0:40401 with
backlog 1280.
java.lang.Exception
at
org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.<init>(AcceptorImpl.java:605)
at
org.apache.geode.internal.cache.tier.sockets.AcceptorBuilder.create(AcceptorBuilder.java:303)
at
org.apache.geode.internal.cache.CacheServerImpl.createAcceptor(CacheServerImpl.java:429)
at
org.apache.geode.internal.cache.CacheServerImpl.start(CacheServerImpl.java:386)
at
org.apache.geode.distributed.internal.InternalDistributedSystem.createAndStartCacheServers(InternalDistributedSystem.java:2750)
at
org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2634)
at
org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2390)
at
org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1242)
at
org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2311)
at
org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1283)
at
org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:2027)
at java.lang.Thread.run(Thread.java:745)
{noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)