[
https://issues.apache.org/jira/browse/GEODE-6823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853381#comment-16853381
]
Bruce Schuchardt commented on GEODE-6823:
-----------------------------------------
This seems to have been introduced during refactoring of Elder management, so
it's present in 1.8 and 1.9.
A check for the state of the variable
ClusterDistributionManager.isCloseInProgress was changed to use the method
isCloseInProgress(), which invokes InternalDistributedSystem.isDisconnecting().
That method is flawed in that it will return true before the
InternalDistributedSystem's "dm" variable has been initialized, which is the
case during startup.
> Hang in ElderInitProcessor.init()
> ---------------------------------
>
> Key: GEODE-6823
> URL: https://issues.apache.org/jira/browse/GEODE-6823
> Project: Geode
> Issue Type: Bug
> Components: distributed lock service
> Affects Versions: 1.8.0, 1.9.0
> Reporter: Bruce Schuchardt
> Assignee: Bruce Schuchardt
> Priority: Major
>
> A locator and a server were spinning up at the same time and the locator
> became stuck trying to initialize a distributed lock service. Extra logging
> showed that the server received an ElderInitMessage that it decided to ignore
> because it thought it was shutting down.
>
> {noformat}
> gemfire2_2430/system.log: [info 2019/05/29 11:00:34.230 PDT
> <vm_2_thr_40_client2_host1_2430> tid=0x24] Initial (distribution manager)
> view,
> View[rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000|2]
> members:
> [rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000,
> rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2416:2416)<ec><v1>:41001{lead},
>
> rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2420:2420)<ec><v2>:41002,
>
> rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2430:2430)<ec><v2>:41003]
> gemfire2_2430/system.log: [debug 2019/05/29 11:00:34.463 PDT <unicast
> receiver,rs-GEM-2316-0906a2i3large-hydra-client-11-10705> tid=0x46] Received
> message 'ElderInitMessage (processorId='1)' from
> <rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000>
> gemfire2_2430/system.log: [debug 2019/05/29 11:00:34.574 PDT <Pooled Message
> Processor 2> tid=0x4d] Waiting for Elder to change. Expecting Elder to be
> rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000, is
> rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000.
> gemfire2_2430/system.log: [info 2019/05/29 11:00:34.575 PDT <Pooled Message
> Processor 2> tid=0x4d] ElderInitMessage (processorId='1): disregarding
> request from departed member.
> gemfire2_2430/system.log: [info 2019/05/29 11:00:35.238 PDT <unicast
> receiver,rs-GEM-2316-0906a2i3large-hydra-client-11-10705> tid=0x46] received
> new view:
> View[rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000|3]
> members:
> [rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000,
> rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2416:2416)<ec><v1>:41001{lead},
>
> rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2420:2420)<ec><v2>:41002,
>
> rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2430:2430)<ec><v2>:41003,
>
> rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2437:2437)<ec><v3>:41004]
> locator_ds_2354/system.log: [warn 2019/05/29 11:00:50.430 PDT <Pooled Message
> Processor 2> tid=0x38] 15 seconds have elapsed while waiting for replies:
> <ElderInitProcessor 1 waiting for 1 replies from
> [rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2430:2430)<ec><v2>:41003]>
> on rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000
> whose current membership list is:
> [[rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2437:2437)<ec><v3>:41004,
>
> rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2430:2430)<ec><v2>:41003,
> rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000,
> rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2420:2420)<ec><v2>:41002,
>
> rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2416:2416)<ec><v1>:41001]]
> [Stack #1 from bgexec15197_2354.log line 2]
> "Pooled Message Processor 2" #56 daemon prio=5 os_prio=0
> tid=0x000000000194e800 nid=0xae3 waiting on condition [0x00007f5c94dce000]
> java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x0000000775ff4f08> (a
> java.util.concurrent.CountDownLatch$Sync)
> at
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
> at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
> at
> org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:71)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:716)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:787)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:764)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:850)
> at
> org.apache.geode.distributed.internal.locks.ElderInitProcessor.init(ElderInitProcessor.java:69)
> at
> org.apache.geode.distributed.internal.locks.ElderState.<init>(ElderState.java:53)
> at
> org.apache.geode.distributed.internal.ClusterElderManager.lambda$new$0(ClusterElderManager.java:41)
> at
> org.apache.geode.distributed.internal.ClusterElderManager$$Lambda$64/1182435120.get(Unknown
> Source)
> at
> org.apache.geode.distributed.internal.ClusterElderManager.initializeElderState(ClusterElderManager.java:107)
> at
> org.apache.geode.distributed.internal.ClusterElderManager.getElderState(ClusterElderManager.java:98)
> at
> org.apache.geode.distributed.internal.ClusterDistributionManager.getElderState(ClusterDistributionManager.java:2885)
> at
> org.apache.geode.distributed.internal.locks.GrantorRequestProcessor$GrantorRequestMessage.basicProcess(GrantorRequestProcessor.java:494)
> at
> org.apache.geode.distributed.internal.locks.GrantorRequestProcessor$GrantorRequestMessage.process(GrantorRequestProcessor.java:489)
> at
> org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:369)
> at
> org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:435)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at
> org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:959)
> at
> org.apache.geode.distributed.internal.ClusterDistributionManager.doProcessingThread(ClusterDistributionManager.java:825)
> at
> org.apache.geode.distributed.internal.ClusterDistributionManager$$Lambda$77/482261794.invoke(Unknown
> Source)
> at
> org.apache.geode.internal.logging.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:121)
> at
> org.apache.geode.internal.logging.LoggingThreadFactory$$Lambda$78/2015475584.run(Unknown
> Source)
> at java.lang.Thread.run(Thread.java:748)
> so 2430 did not respond to the ElderInitMessage. That would happen if 2430
> was shutting down, if 2354 was no longer a member or if 2430 became the elder.
> {noformat}
> I added more logging to ClusterElderManager.waitForElder(), which is
> returning false and causing this "disregarding" message to be logged. What I
> see is that the server doing this thinks it's shutting down but it most
> definitely is not:
> {noformat}
> [debug 2019/05/29 15:58:48.488 PDT <unicast
> receiver,rs-GEM-2316-1539a0i3large-hydra-client-4-39597> tid=0x46] Received
> message 'ElderInitMessage (processorId=1)' from
> <rs-GEM-2316-1539a0i3large-hydra-client-4(15224:locator)<ec><v0>:41000>
> [debug 2019/05/29 15:58:48.510 PDT <Pooled Message Processor 2> tid=0x4d]
> Wait for elder terminated because we're shutting down
> [info 2019/05/29 15:58:48.511 PDT <Pooled Message Processor 2> tid=0x4d]
> ElderInitMessage (processorId=1): disregarding request from departed member.
> [debug 2019/05/29 15:58:48.599 PDT <unicast
> receiver,rs-GEM-2316-1539a0i3large-hydra-client-4-39597> tid=0x46] processing
> InstallViewMessage(type=PREPARE; Current ViewID=3; Previous View ID=0;
> View[rs-GEM-2316-1539a0i3large-hydra-client-4(15224:locator)<ec><v0>:41000|3]
> members:
> [rs-GEM-2316-1539a0i3large-hydra-client-4(15224:locator)<ec><v0>:41000,
> rs-GEM-2316-1539a0i3large-hydra-client-4(gemfire1_host1_15283:15283)<ec><v1>:41001{lead},
>
> rs-GEM-2316-1539a0i3large-hydra-client-4(gemfire1_host1_15290:15290)<ec><v2>:41002,
>
> rs-GEM-2316-1539a0i3large-hydra-client-4(gemfire2_host1_15295:15295)<ec><v3>:41004,
>
> rs-GEM-2316-1539a0i3large-hydra-client-4(gemfire2_host1_15304:15304)<ec><v3>:41003];
> cred=null){noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)