[
https://issues.apache.org/jira/browse/GEODE-6904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dick Cavender closed GEODE-6904.
--------------------------------
> Reconnecting locator has many hung threads, causing members to startup
> without cluster configuration
> ----------------------------------------------------------------------------------------------------
>
> Key: GEODE-6904
> URL: https://issues.apache.org/jira/browse/GEODE-6904
> Project: Geode
> Issue Type: Bug
> Components: configuration, membership
> Reporter: Dan Smith
> Priority: Major
> Fix For: 1.10.0
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> With the following steps, a locator can get into state where it is stuck in
> the middle of reconnecting. It allows members to join the system, but they
> timeout sending it startup messages and start up without cluster
> configuration, resulting not being able to restart the cluster.
> # Start 2 locators and some number of servers
> # Kill one locator and trigger a force disconnect in the remaining locators
> and servers at the same time
> # Have one of the members take a little bit of time before reconnecting, to
> let the locator get to recovering the _ConfurationRegion before that
> remaining member joins.
> When this happens, the remaining locator gets hung trying to reconnect the
> system, waiting in initialization of _ConfigurationRegion for persistent data
> from the missing locator.
> {noformat}
> "Location services restart thread" #98 daemon prio=5 os_prio=31
> tid=0x00007fa382944800 nid=0x9a07 in Object.wait() [0x0000700008943000]
> java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at
> org.apache.geode.internal.cache.persistence.MembershipChangeListener.waitForChange(MembershipChangeListener.java:62)
> - locked <0x00000007be285800> (a
> org.apache.geode.internal.cache.persistence.MembershipChangeListener)
> at
> org.apache.geode.internal.cache.persistence.PersistenceInitialImageAdvisor.waitForMembershipChangeForMissingDiskStores(PersistenceInitialImageAdvisor.java:218)
> at
> org.apache.geode.internal.cache.persistence.PersistenceInitialImageAdvisor.getAdvice(PersistenceInitialImageAdvisor.java:118)
> at
> org.apache.geode.internal.cache.persistence.PersistenceAdvisorImpl.getInitialImageAdvice(PersistenceAdvisorImpl.java:835)
> at
> org.apache.geode.internal.cache.persistence.CreatePersistentRegionProcessor.getInitialImageAdvice(CreatePersistentRegionProcessor.java:52)
> at
> org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1189)
> at
> org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1074)
> at
> org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3002)
> at
> org.apache.geode.distributed.internal.InternalConfigurationPersistenceService.getConfigurationRegion(InternalConfigurationPersistenceService.java:840)
> at
> org.apache.geode.distributed.internal.InternalConfigurationPersistenceService.initSharedConfiguration(InternalConfigurationPersistenceService.java:487)
> at
> org.apache.geode.distributed.internal.InternalLocator.startConfigurationPersistenceService(InternalLocator.java:1465)
> at
> org.apache.geode.distributed.internal.InternalLocator.startClusterManagementService(InternalLocator.java:687)
> at
> org.apache.geode.distributed.internal.InternalLocator.restartWithDS(InternalLocator.java:1126)
> - locked <0x00000007a0c313b8> (a java.lang.Object)
> at
> org.apache.geode.distributed.internal.InternalLocator.attemptReconnect(InternalLocator.java:1065)
> at
> org.apache.geode.distributed.internal.InternalLocator.lambda$launchRestartThread$1(InternalLocator.java:986)
> at
> org.apache.geode.distributed.internal.InternalLocator$$Lambda$195/681333823.run(Unknown
> Source)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> The above thread holds a static lock, which causes many of the messages that
> get sent to the locator to hang.
> One of these messages is a StartupMessage. If that message hangs, the member
> that sent the message will timeout after 15 seconds and then start up without
> cluster configuration.
> {noformat}
> [vm4] [warn 2019/06/24 16:26:58.742 PDT <RMI TCP Connection(8)-10.118.20.154>
> tid=0x15] Membership: startup timed out after waiting 15000 milliseconds for
> responses from [10.118.20.154(locator-1:66321:locator)<ec><v0>:41002]
> ---This message is logged because by not waiting for the reply from the
> startup message, we do not discover that the locator has cluster
> configuration.
> [vm4] [info 2019/06/24 16:26:58.827 PDT <RMI TCP Connection(8)-10.118.20.154>
> tid=0x15] No locator(s) found with cluster configuration service
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)