Alberto Gomez created GEODE-10403:
-------------------------------------

             Summary: Distributed deadlock when stopping gateway sender
                 Key: GEODE-10403
                 URL: https://issues.apache.org/jira/browse/GEODE-10403
             Project: Geode
          Issue Type: Bug
          Components: wan
    Affects Versions: 1.15.0, 1.14.4, 1.13.8, 1.12.9
            Reporter: Alberto Gomez


A distributed deadlock has been found during some tests of a Geode system with 
WAN replication when stopping the gateway sender while sending a fair amount of 
operations to the servers.

The distributed deadlock manifests in the gateway sender stop command hanging 
forever and by all normal Geode operations from clients (gets, puts,...) not 
being responded.
The situation is provoked by the Gateway sender stop command that first takes 
the lifecycle lock and then, at a given point, tries to retrieve the size of 
the gateway sender. This operation, that requires communication with the other 
peers never finishes, probably because the response from one of the peers is 
never received.
Another thread is blocked when trying to acquire the lifecycle lock in 
AbstractGatewaySender.distribute().
Finally many threads handling Geode operations (get, put...) get blocked in the 
DistributedCacheOperation._distribute() call waiting for a response from 
another peer.

Thread dump section from blocked gateway sender stop command in call to get 
size of queue:
"ConcurrentParallelGatewaySenderEventProcessor Stopper Thread4" #1319 daemon 
prio=10 os_prio=0 cpu=46.95ms elapsed=4152.76s tid=0x00007f92bc1bb000 
nid=0x2157 waiting on condition  [0x00007f9179bd1000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
        - parking to wait for  <0x000000031ca2cbd8> (a 
java.util.concurrent.CountDownLatch$Sync)
        at 
java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)
        at 
java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)
        at 
org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)
        at 
org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)
        at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)
        at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)
        at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)
        at 
org.apache.geode.internal.cache.partitioned.SizeMessage$SizeResponse.waitBucketSizes(SizeMessage.java:344)
        at 
org.apache.geode.internal.cache.PartitionedRegion.getSizeRemotely(PartitionedRegion.java:6758)
        at 
org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6709)
        at 
org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6691)
        at 
org.apache.geode.internal.cache.PartitionedRegion.getRegionSize(PartitionedRegion.java:6663)
        at 
org.apache.geode.internal.cache.LocalRegionDataView.entryCount(LocalRegionDataView.java:99)
        at 
org.apache.geode.internal.cache.LocalRegion.entryCount(LocalRegion.java:2078)
        at 
org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8301)
        at 
org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.size(ParallelGatewaySenderQueue.java:1670)
        at 
org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.closeProcessor(AbstractGatewaySenderEventProcessor.java:1259)
        at 
org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.stopProcessing(AbstractGatewaySenderEventProcessor.java:1247)
        at 
org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1399)
        at 
org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1387)
        at 
java.util.concurrent.FutureTask.run(java.base@11.0.11/FutureTask.java:264)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)


Thread dump section from blocked call to AbstractGatewaySender.distribute() 
call trying to acquire the lifecycle lock:
"P2P message reader for 192.168.78.164(eric-data-kvdb-ag-server-0:1)<v31>:41000 
shared ordered uid=6 local port=60360 remote port=57246" #56 daemon prio=10 
os_prio=0 cpu=462104.83ms elapsed=7095.02s tid=0x00007f93a8007800 nid=0x50 
waiting on condition  [0x00007f93e59d0000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
        - parking to wait for  <0x00000000ed9cb9f0> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
        at 
java.util.concurrent.locks.LockSupport.park(java.base@11.0.11/LockSupport.java:194)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.11/AbstractQueuedSynchronizer.java:885)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(java.base@11.0.11/AbstractQueuedSynchronizer.java:1009)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(java.base@11.0.11/AbstractQueuedSynchronizer.java:1324)
        at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(java.base@11.0.11/ReentrantReadWriteLock.java:738)
        at 
org.apache.geode.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:1104)
        at 
org.apache.geode.internal.cache.LocalRegion.notifyGatewaySender(LocalRegion.java:6144)
        at 
org.apache.geode.internal.cache.LocalRegion.notifyGatewaySender(LocalRegion.java:6108)
        at 
org.apache.geode.internal.cache.BucketRegion.notifyGatewaySender(BucketRegion.java:719)
        at 
org.apache.geode.internal.cache.LocalRegion.basicPutPart2(LocalRegion.java:5775)
        at 
org.apache.geode.internal.cache.BucketRegion.basicPutPart2(BucketRegion.java:704)
        at 
org.apache.geode.internal.cache.map.RegionMapPut.doBeforeCompletionActions(RegionMapPut.java:282)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutAndDeliverEvent(AbstractRegionMapPut.java:301)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$515/0x00000008006e0440.run(Unknown
 Source)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWithIndexUpdatingInProgress(AbstractRegionMapPut.java:308)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutIfPreconditionsSatisified(AbstractRegionMapPut.java:296)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnSynchronizedRegionEntry(AbstractRegionMapPut.java:282)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnRegionEntryInMap(AbstractRegionMapPut.java:273)
        - locked <0x0000000136123330> (a 
org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.addRegionEntryToMapAndDoPut(AbstractRegionMapPut.java:251)
        - locked <0x0000000136123330> (a 
org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutRetryingIfNeeded(AbstractRegionMapPut.java:216)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$514/0x00000008006e0040.run(Unknown
 Source)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doWithIndexInUpdateMode(AbstractRegionMapPut.java:198)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPut(AbstractRegionMapPut.java:180)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$513/0x00000008006ca440.run(Unknown
 Source)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWhileLockedForCacheModification(AbstractRegionMapPut.java:119)
        at 
org.apache.geode.internal.cache.map.RegionMapPut.runWhileLockedForCacheModification(RegionMapPut.java:161)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.put(AbstractRegionMapPut.java:169)
        at 
org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2033)
        at 
org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:530)
        at 
org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:170)
        at 
org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5571)
        at 
org.apache.geode.internal.cache.AbstractUpdateOperation.doPutOrCreate(AbstractUpdateOperation.java:194)
        at 
org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.basicOperateOnRegion(AbstractUpdateOperation.java:307)
        at 
org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.operateOnRegion(AbstractUpdateOperation.java:278)
        at 
org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.basicProcess(DistributedCacheOperation.java:1208)
        at 
org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.process(DistributedCacheOperation.java:1110)
        at 
org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:376)
        at 
org.apache.geode.distributed.internal.DistributionMessage.schedule(DistributionMessage.java:432)
        at 
org.apache.geode.distributed.internal.ClusterDistributionManager.scheduleIncomingMessage(ClusterDistributionManager.java:2060)
        at 
org.apache.geode.distributed.internal.ClusterDistributionManager.handleIncomingDMsg(ClusterDistributionManager.java:1826)
        at 
org.apache.geode.distributed.internal.ClusterDistributionManager$$Lambda$178/0x0000000800380440.messageReceived(Unknown
 Source)
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership.dispatchMessage(GMSMembership.java:936)
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership.handleOrDeferMessage(GMSMembership.java:867)
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership.processMessage(GMSMembership.java:1209)
        at 
org.apache.geode.distributed.internal.DistributionImpl$MyDCReceiver.messageReceived(DistributionImpl.java:828)
        at 
org.apache.geode.distributed.internal.direct.DirectChannel.receive(DirectChannel.java:614)
        at 
org.apache.geode.internal.tcp.TCPConduit.messageReceived(TCPConduit.java:679)
        at 
org.apache.geode.internal.tcp.Connection.dispatchMessage(Connection.java:3261)
        at 
org.apache.geode.internal.tcp.Connection.readMessage(Connection.java:2988)
        at 
org.apache.geode.internal.tcp.Connection.processInputBuffer(Connection.java:2794)
        at 
org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1648)
        at org.apache.geode.internal.tcp.Connection.run(Connection.java:1479)
        at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)


Thread dump section from blocked calls to 
DistributedCacheOperation._distribute() waiting a for a response from a remote 
peer:
"ServerConnection on port 40404 Thread 2" #88 daemon prio=5 os_prio=0 
cpu=81268.62ms elapsed=7050.08s tid=0x00007f8160001800 nid=0x73 waiting on 
condition  [0x00007f8196f57000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
        - parking to wait for  <0x000000031befad38> (a 
java.util.concurrent.CountDownLatch$Sync)
        at 
java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)
        at 
java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)
        at 
org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)
        at 
org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)
        at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)
        at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)
        at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)
        at 
org.apache.geode.internal.cache.DistributedCacheOperation.waitForAckIfNeeded(DistributedCacheOperation.java:779)
        at 
org.apache.geode.internal.cache.DistributedCacheOperation._distribute(DistributedCacheOperation.java:676)
        at 
org.apache.geode.internal.cache.DistributedCacheOperation.startOperation(DistributedCacheOperation.java:277)
        at 
org.apache.geode.internal.cache.BucketRegion.basicPutPart2(BucketRegion.java:694)
        at 
org.apache.geode.internal.cache.map.RegionMapPut.doBeforeCompletionActions(RegionMapPut.java:282)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutAndDeliverEvent(AbstractRegionMapPut.java:301)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$514/0x00000008006c9c40.run(Unknown
 Source)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWithIndexUpdatingInProgress(AbstractRegionMapPut.java:308)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutIfPreconditionsSatisified(AbstractRegionMapPut.java:296)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnSynchronizedRegionEntry(AbstractRegionMapPut.java:282)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnRegionEntryInMap(AbstractRegionMapPut.java:273)
        - locked <0x00000001eb353770> (a 
org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.addRegionEntryToMapAndDoPut(AbstractRegionMapPut.java:251)
        - locked <0x00000001eb353770> (a 
org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutRetryingIfNeeded(AbstractRegionMapPut.java:216)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$513/0x00000008006c9840.run(Unknown
 Source)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doWithIndexInUpdateMode(AbstractRegionMapPut.java:198)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPut(AbstractRegionMapPut.java:180)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$512/0x00000008006ca440.run(Unknown
 Source)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWhileLockedForCacheModification(AbstractRegionMapPut.java:119)
        at 
org.apache.geode.internal.cache.map.RegionMapPut.runWhileLockedForCacheModification(RegionMapPut.java:161)
        at 
org.apache.geode.internal.cache.map.AbstractRegionMapPut.put(AbstractRegionMapPut.java:169)
        at 
org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2033)
        at 
org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:530)
        at 
org.apache.geode.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5578)
        at 
org.apache.geode.internal.cache.PartitionedRegionDataStore.putLocally(PartitionedRegionDataStore.java:1213)
        at 
org.apache.geode.internal.cache.PartitionedRegion.putInBucket(PartitionedRegion.java:3005)
        at 
org.apache.geode.internal.cache.PartitionedRegion.virtualPut(PartitionedRegion.java:2215)
        at 
org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:170)
        at 
org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5571)
        at 
org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5531)
        at 
org.apache.geode.internal.cache.LocalRegion.basicBridgePut(LocalRegion.java:5210)
        at 
org.apache.geode.internal.cache.tier.sockets.command.Put65.cmdExecute(Put65.java:411)
        at 
org.apache.geode.internal.cache.tier.sockets.BaseCommand.execute(BaseCommand.java:183)
        at 
org.apache.geode.internal.cache.tier.sockets.ServerConnection.doNormalMessage(ServerConnection.java:848)
        at 
org.apache.geode.internal.cache.tier.sockets.OriginalServerConnection.doOneMessage(OriginalServerConnection.java:72)
        at 
org.apache.geode.internal.cache.tier.sockets.ServerConnection.run(ServerConnection.java:1181)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)
        at 
org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.lambda$initializeServerConnectionThreadPool$3(AcceptorImpl.java:691)
        at 
org.apache.geode.internal.cache.tier.sockets.AcceptorImpl$$Lambda$495/0x00000008006be440.invoke(Unknown
 Source)
        at 
org.apache.geode.logging.internal.executors.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:120)
        at 
org.apache.geode.logging.internal.executors.LoggingThreadFactory$$Lambda$166/0x000000080034c040.run(Unknown
 Source)
        at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to