[ 
https://issues.apache.org/jira/browse/GEODE-6904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882072#comment-16882072
 ] 

ASF subversion and git services commented on GEODE-6904:
--------------------------------------------------------

Commit 00ed2f3ca9025beb200a0851c530d041575bbd20 in geode's branch 
refs/heads/develop from Ernie Burghardt
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=00ed2f3 ]

GEODE-6904: Fix hang on locator when restarting due to lock (#3777)

Only hold the locatorLock while reading the locator variable in
restartWithDS. Holding this lock while recovering cluster configuration
was causing other messages to the locator, including the StartupMessage,
to block waiting for this lock.

In addition, wait forever for StartupMessage responses.
StartupOperation had a 15 second timeout, after which it would stop
waiting. That resulted in not having correct metadata about other
members in the system. We were also not waking up the thread in
waitForReplies when a StartupResponseMessage came in with a
rejectionMessage.

Co-Authored-By: Dan Smith <[email protected]>


> Reconnecting locator has many hung threads, causing members to startup 
> without cluster configuration
> ----------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-6904
>                 URL: https://issues.apache.org/jira/browse/GEODE-6904
>             Project: Geode
>          Issue Type: Bug
>          Components: configuration, membership
>            Reporter: Dan Smith
>            Priority: Major
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> With the following steps, a locator can get into state where it is stuck in 
> the middle of reconnecting. It allows members to join the system, but they 
> timeout sending it startup messages and start up without cluster 
> configuration, resulting not being able to restart the cluster.
> # Start 2 locators and some number of servers
> # Kill one locator and trigger a force disconnect in the remaining locators 
> and servers at the same time
> # Have one of the members take a little bit of time before reconnecting, to 
> let the locator get to recovering the _ConfurationRegion before that 
> remaining member joins.
>  When this happens, the remaining locator gets hung trying to reconnect the 
> system, waiting in initialization of _ConfigurationRegion for persistent data 
> from the missing locator.
> {noformat}
> "Location services restart thread" #98 daemon prio=5 os_prio=31 
> tid=0x00007fa382944800 nid=0x9a07 in Object.wait() [0x0000700008943000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>       at java.lang.Object.wait(Native Method)
>       at 
> org.apache.geode.internal.cache.persistence.MembershipChangeListener.waitForChange(MembershipChangeListener.java:62)
>       - locked <0x00000007be285800> (a 
> org.apache.geode.internal.cache.persistence.MembershipChangeListener)
>       at 
> org.apache.geode.internal.cache.persistence.PersistenceInitialImageAdvisor.waitForMembershipChangeForMissingDiskStores(PersistenceInitialImageAdvisor.java:218)
>       at 
> org.apache.geode.internal.cache.persistence.PersistenceInitialImageAdvisor.getAdvice(PersistenceInitialImageAdvisor.java:118)
>       at 
> org.apache.geode.internal.cache.persistence.PersistenceAdvisorImpl.getInitialImageAdvice(PersistenceAdvisorImpl.java:835)
>       at 
> org.apache.geode.internal.cache.persistence.CreatePersistentRegionProcessor.getInitialImageAdvice(CreatePersistentRegionProcessor.java:52)
>       at 
> org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1189)
>       at 
> org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1074)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3002)
>       at 
> org.apache.geode.distributed.internal.InternalConfigurationPersistenceService.getConfigurationRegion(InternalConfigurationPersistenceService.java:840)
>       at 
> org.apache.geode.distributed.internal.InternalConfigurationPersistenceService.initSharedConfiguration(InternalConfigurationPersistenceService.java:487)
>       at 
> org.apache.geode.distributed.internal.InternalLocator.startConfigurationPersistenceService(InternalLocator.java:1465)
>       at 
> org.apache.geode.distributed.internal.InternalLocator.startClusterManagementService(InternalLocator.java:687)
>       at 
> org.apache.geode.distributed.internal.InternalLocator.restartWithDS(InternalLocator.java:1126)
>       - locked <0x00000007a0c313b8> (a java.lang.Object)
>       at 
> org.apache.geode.distributed.internal.InternalLocator.attemptReconnect(InternalLocator.java:1065)
>       at 
> org.apache.geode.distributed.internal.InternalLocator.lambda$launchRestartThread$1(InternalLocator.java:986)
>       at 
> org.apache.geode.distributed.internal.InternalLocator$$Lambda$195/681333823.run(Unknown
>  Source)
>       at java.lang.Thread.run(Thread.java:748)
> {noformat}
> The above thread holds a static lock, which causes many of the messages that 
> get sent to the locator to hang.
> One of these messages is a StartupMessage. If that message hangs, the member 
> that sent the message will timeout after 15 seconds and then start up without 
> cluster configuration.
> {noformat}
> [vm4] [warn 2019/06/24 16:26:58.742 PDT <RMI TCP Connection(8)-10.118.20.154> 
> tid=0x15] Membership: startup timed out after waiting 15000 milliseconds for 
> responses from [10.118.20.154(locator-1:66321:locator)<ec><v0>:41002]
> ---This message is logged because by not waiting for the reply from the 
> startup message, we do not discover that the locator has cluster 
> configuration.
> [vm4] [info 2019/06/24 16:26:58.827 PDT <RMI TCP Connection(8)-10.118.20.154> 
> tid=0x15] No locator(s) found with cluster configuration service
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to