Bruce J Schuchardt created GEODE-8473:
-----------------------------------------

             Summary: Hang in ReplyProcessor21 when forced-disconnect does not 
establish a cancellation cause
                 Key: GEODE-8473
                 URL: https://issues.apache.org/jira/browse/GEODE-8473
             Project: Geode
          Issue Type: Bug
          Components: membership
    Affects Versions: 1.13.0
            Reporter: Bruce J Schuchardt


I suspect this is due to the recent Membership refactoring.  In a test that 
exposed GEODE-8467 I saw an application thread from before the 
forced-disconnect still hanging around waiting for a response.
{noformat}
   java.lang.Thread.State: TIMED_WAITING (parking)   java.lang.Thread.State: 
TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to 
wait for  <0x00000000ea5c43c0> (a java.util.concurrent.CountDownLatch$Sync) at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
 at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) at 
org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)
 at 
org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)
 at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)
 at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)
 at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)
 at 
org.apache.geode.internal.cache.partitioned.SizeMessage$SizeResponse.waitBucketSizes(SizeMessage.java:344)
 at 
org.apache.geode.internal.cache.PartitionedRegion.getSizeRemotely(PartitionedRegion.java:6752)
 at 
org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6703)
 at 
org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6685)
 at 
org.apache.geode.internal.cache.PartitionedRegion.getRegionSize(PartitionedRegion.java:6657)
 at 
org.apache.geode.internal.cache.LocalRegionDataView.entryCount(LocalRegionDataView.java:99)
 at 
org.apache.geode.internal.cache.LocalRegion.entryCount(LocalRegion.java:2078) 
at org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8288) at 
util.TestHelper.getRegionStr(TestHelper.java:1669) at 
util.TestHelper.regionHierarchyToString(TestHelper.java:1654) at 
util.TestHelper.logRegionHierarchy(TestHelper.java:1639) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
hydra.MethExecutor.execute(MethExecutor.java:173) at 
hydra.MethExecutor.execute(MethExecutor.java:141) at 
hydra.TestTask.execute(TestTask.java:197) at 
hydra.RemoteTestModule$1.run(RemoteTestModule.java:213) {noformat}
ReplyProcessor21 uses a StoppableCountdownLatch to wait for a response.  This 
latch loops waiting for countdown but also checks ClusterDistributionManager's 
CancelCriterion to see if the system is shutting down.  If so it stops waiting 
for a response.

Due to GEODE-8467 the thread that sets the CancelCriterion's shutdown 
"rootCause" is never started.  Either Membership needs to ensure that this 
upward notification happens or ClusterDistributionManager's CancelCriterion 
needs to check with the Services.Stopper in GMSMembership to see if a 
"rootCause" has been established there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to