[jira] [Closed] (GEODE-6904) Reconnecting locator has many hung threads, causing members to startup without cluster configuration

Dick Cavender (Jira) Thu, 26 Sep 2019 11:06:31 -0700


     [ 
https://issues.apache.org/jira/browse/GEODE-6904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dick Cavender closed GEODE-6904.
--------------------------------

> Reconnecting locator has many hung threads, causing members to startup 
> without cluster configuration
> ----------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-6904
>                 URL: https://issues.apache.org/jira/browse/GEODE-6904
>             Project: Geode
>          Issue Type: Bug
>          Components: configuration, membership
>            Reporter: Dan Smith
>            Priority: Major
>             Fix For: 1.10.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> With the following steps, a locator can get into state where it is stuck in 
> the middle of reconnecting. It allows members to join the system, but they 
> timeout sending it startup messages and start up without cluster 
> configuration, resulting not being able to restart the cluster.
> # Start 2 locators and some number of servers
> # Kill one locator and trigger a force disconnect in the remaining locators 
> and servers at the same time
> # Have one of the members take a little bit of time before reconnecting, to 
> let the locator get to recovering the _ConfurationRegion before that 
> remaining member joins.
>  When this happens, the remaining locator gets hung trying to reconnect the 
> system, waiting in initialization of _ConfigurationRegion for persistent data 
> from the missing locator.
> {noformat}
> "Location services restart thread" #98 daemon prio=5 os_prio=31 
> tid=0x00007fa382944800 nid=0x9a07 in Object.wait() [0x0000700008943000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>       at java.lang.Object.wait(Native Method)
>       at 
> org.apache.geode.internal.cache.persistence.MembershipChangeListener.waitForChange(MembershipChangeListener.java:62)
>       - locked <0x00000007be285800> (a 
> org.apache.geode.internal.cache.persistence.MembershipChangeListener)
>       at 
> org.apache.geode.internal.cache.persistence.PersistenceInitialImageAdvisor.waitForMembershipChangeForMissingDiskStores(PersistenceInitialImageAdvisor.java:218)
>       at 
> org.apache.geode.internal.cache.persistence.PersistenceInitialImageAdvisor.getAdvice(PersistenceInitialImageAdvisor.java:118)
>       at 
> org.apache.geode.internal.cache.persistence.PersistenceAdvisorImpl.getInitialImageAdvice(PersistenceAdvisorImpl.java:835)
>       at 
> org.apache.geode.internal.cache.persistence.CreatePersistentRegionProcessor.getInitialImageAdvice(CreatePersistentRegionProcessor.java:52)
>       at 
> org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1189)
>       at 
> org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1074)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3002)
>       at 
> org.apache.geode.distributed.internal.InternalConfigurationPersistenceService.getConfigurationRegion(InternalConfigurationPersistenceService.java:840)
>       at 
> org.apache.geode.distributed.internal.InternalConfigurationPersistenceService.initSharedConfiguration(InternalConfigurationPersistenceService.java:487)
>       at 
> org.apache.geode.distributed.internal.InternalLocator.startConfigurationPersistenceService(InternalLocator.java:1465)
>       at 
> org.apache.geode.distributed.internal.InternalLocator.startClusterManagementService(InternalLocator.java:687)
>       at 
> org.apache.geode.distributed.internal.InternalLocator.restartWithDS(InternalLocator.java:1126)
>       - locked <0x00000007a0c313b8> (a java.lang.Object)
>       at 
> org.apache.geode.distributed.internal.InternalLocator.attemptReconnect(InternalLocator.java:1065)
>       at 
> org.apache.geode.distributed.internal.InternalLocator.lambda$launchRestartThread$1(InternalLocator.java:986)
>       at 
> org.apache.geode.distributed.internal.InternalLocator$$Lambda$195/681333823.run(Unknown
>  Source)
>       at java.lang.Thread.run(Thread.java:748)
> {noformat}
> The above thread holds a static lock, which causes many of the messages that 
> get sent to the locator to hang.
> One of these messages is a StartupMessage. If that message hangs, the member 
> that sent the message will timeout after 15 seconds and then start up without 
> cluster configuration.
> {noformat}
> [vm4] [warn 2019/06/24 16:26:58.742 PDT <RMI TCP Connection(8)-10.118.20.154> 
> tid=0x15] Membership: startup timed out after waiting 15000 milliseconds for 
> responses from [10.118.20.154(locator-1:66321:locator)<ec><v0>:41002]
> ---This message is logged because by not waiting for the reply from the 
> startup message, we do not discover that the locator has cluster 
> configuration.
> [vm4] [info 2019/06/24 16:26:58.827 PDT <RMI TCP Connection(8)-10.118.20.154> 
> tid=0x15] No locator(s) found with cluster configuration service
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (GEODE-6904) Reconnecting locator has many hung threads, causing members to startup without cluster configuration

Reply via email to