[ https://issues.apache.org/jira/browse/GEODE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shelley Lynn Hughes-Godfrey updated GEODE-4892: ----------------------------------------------- Affects Version/s: 1.6.0 > CI failure: geode-wan distributedTest hung in > org.apache.geode.internal.cache.wan.wancommand.StatusGatewaySenderCommandDUnitTest > testGatewaySenderStatus_OnMember Ended > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: GEODE-4892 > URL: https://issues.apache.org/jira/browse/GEODE-4892 > Project: Geode > Issue Type: Bug > Components: distributed lock service, gfsh > Affects Versions: 1.6.0 > Reporter: Shelley Lynn Hughes-Godfrey > Priority: Major > > geode-wan distributed tests hung after 8 hours (usually takes 5-6 hours) > https://concourse.apachegeode-ci.info/teams/main/pipelines/develop/jobs/DistributedTest/builds/207 > The progress files show this DistributedTest run starting ~4 am ... but it > also shows this test completing at 9:05 ... which is only 5 hours. Are we > somehow stuck in cache close for 3 hours before the hang is declared while > trying to shut down the DUnit VM. > dunit-hangs.txt > ``` > Started @ 2018-03-17 03:57:50.567 +0000 > Ended @ 2018-03-17 03:59:28.268 +0000 > Started @ 2018-03-17 04:00:05.872 +0000 > Ended @ 2018-03-17 04:52:41.879 +0000 > Started @ 2018-03-17 04:53:00.358 +0000 > Ended @ 2018-03-17 04:55:07.622 +0000 > Started @ 2018-03-17 04:55:33.550 +0000 > Ended @ 2018-03-17 07:34:11.639 +0000 > Started @ 2018-03-17 07:34:51.022 +0000 > Ended @ 2018-03-17 08:08:34.849 +0000 > Started @ 2018-03-17 08:09:33.853 +0000 > Ended @ 2018-03-17 08:31:43.646 +0000 > Started @ 2018-03-17 08:31:54.175 +0000 > Ended @ 2018-03-17 08:32:02.523 +0000 > Started @ 2018-03-17 08:32:13.946 +0000 > Ended @ 2018-03-17 08:32:34.626 +0000 > Started @ 2018-03-17 08:33:20.100 +0000 > 2018-03-17 08:56:15.914 +0000 > org.apache.geode.internal.cache.wan.wancommand.StatusGatewaySenderCommandDUnitTest > testGatewaySenderStatus_OnMember > Ended @ 2018-03-17 09:05:29.197 +0000 > ``` > Last distributedTest running: > ``` > Started @ 2018-03-17 08:33:20.100 +0000 > 2018-03-17 08:56:15.914 +0000 > org.apache.geode.internal.cache.wan.wancommand.StatusGatewaySenderCommandDUnitTest > testGatewaySenderStatus_OnMember > Ended @ 2018-03-17 09:05:29.197 +0000 > ``` > Stack dumps are taken at 5-10 second intervals and they all look the same ... > so it looks like 433 is stuck in waiting during cache close => > distributedLockServiceDestroys waiting for replies from 493 for ElderInit. > 493 is in ClusterDistributionManager.elderChangeWait(). > ********* Dumping stack for process 433: > 2018-03-17 10:55:59 > ``` > "RMI TCP Connection(1)-172.17.0.3" #30 daemon prio=5 os_prio=0 > tid=0x00007f9760001800 nid=0x1ec waiting on condition [0x00007f975bffb000] > java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000000e08dd870> (a > java.util.concurrent.CountDownLatch$Sync) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) > at > org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:64) > at > org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:715) > at > org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:790) > at > org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:766) > at > org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:853) > at > org.apache.geode.distributed.internal.locks.ElderInitProcessor.init(ElderInitProcessor.java:72) > at > org.apache.geode.distributed.internal.locks.ElderState.<init>(ElderState.java:56) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.getElderStateWithTryLock(ClusterDistributionManager.java:3359) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.getElderState(ClusterDistributionManager.java:3309) > at > org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.startElderCall(GrantorRequestProcessor.java:238) > at > org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.basicOp(GrantorRequestProcessor.java:347) > at > org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.basicOp(GrantorRequestProcessor.java:327) > at > org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.clearGrantor(GrantorRequestProcessor.java:318) > at > org.apache.geode.distributed.internal.locks.DLockService.clearGrantor(DLockService.java:872) > at > org.apache.geode.distributed.internal.locks.DLockGrantor.destroy(DLockGrantor.java:1227) > - locked <0x00000000e09a8768> (a > org.apache.geode.distributed.internal.locks.DLockGrantor) > at > org.apache.geode.distributed.internal.locks.DLockService.nullLockGrantorId(DLockService.java:646) > at > org.apache.geode.distributed.internal.locks.DLockService.basicDestroy(DLockService.java:2358) > at > org.apache.geode.distributed.internal.locks.DLockService.destroyAndRemove(DLockService.java:2276) > - locked <0x00000000e040b400> (a java.lang.Object) > at > org.apache.geode.distributed.internal.locks.DLockService.destroyServiceNamed(DLockService.java:2214) > at > org.apache.geode.distributed.DistributedLockService.destroy(DistributedLockService.java:84) > at > org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderImpl.stop(SerialGatewaySenderImpl.java:136) > at > org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2169) > - locked <0x00000000e03ddf60> (a java.lang.Class for > org.apache.geode.internal.cache.GemFireCacheImpl) > at > org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1961) > at > org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1951) > at > org.apache.geode.test.junit.rules.ServerStarterRule.stopMember(ServerStarterRule.java:102) > at > org.apache.geode.test.junit.rules.MemberStarterRule.after(MemberStarterRule.java:83) > at > org.apache.geode.test.dunit.rules.ClusterStartupRule.stopElementInsideVM(ClusterStartupRule.java:326) > at > org.apache.geode.test.junit.rules.VMProvider.lambda$stopVM$fe0d42dc$1(VMProvider.java:35) > at > org.apache.geode.test.junit.rules.VMProvider$$Lambda$79/501183203.run(Unknown > Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > ``` > ********* Dumping stack for process 493: > 2018-03-17 10:56:02 > ``` > "Pooled Message Processor 1" #179 daemon prio=10 os_prio=0 > tid=0x00007f3640002800 nid=0x763 in Object.wait() [0x00007f35edd65000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.elderChangeWait(ClusterDistributionManager.java:3487) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.waitForElder(ClusterDistributionManager.java:3431) > - locked <0x00000000e069bde0> (a java.lang.Object) > at > org.apache.geode.distributed.internal.locks.ElderInitProcessor$ElderInitMessage.process(ElderInitProcessor.java:178) > at > org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:382) > at > org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:448) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1118) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109) > at > org.apache.geode.distributed.internal.ClusterDistributionManager$4$1.run(ClusterDistributionManager.java:789) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x00000000e07073a8> (a > java.util.concurrent.ThreadPoolExecutor$Worker) > ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005)