[ 
https://issues.apache.org/jira/browse/GEODE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darrel Schneider resolved GEODE-3940.
-------------------------------------
       Resolution: Fixed
    Fix Version/s: 1.4.0

> Backup can hang while trying to get a lock 
> -------------------------------------------
>
>                 Key: GEODE-3940
>                 URL: https://issues.apache.org/jira/browse/GEODE-3940
>             Project: Geode
>          Issue Type: Bug
>          Components: persistence
>            Reporter: Lynn Gallinat
>            Assignee: Darrel Schneider
>             Fix For: 1.4.0
>
>
> {noformat}
> Backup can hang when createKrf cannot get the compactor lock.
> "Pooled Message Processor 2" #196 daemon prio=10 os_prio=0 
> tid=0x00007fef9801e000 nid=0x3cf3 waiting on condition [0x00007ff094cd5000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00000000f1a72c60> (a 
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>         at 
> java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
>         at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
>         at 
> org.apache.geode.internal.cache.Oplog.lockCompactor(Oplog.java:6046)    
> <============ trying to get compactor lock
>         at org.apache.geode.internal.cache.Oplog.createKrf(Oplog.java:4157)   
>      <============ in createKrf
>         at org.apache.geode.internal.cache.Oplog.finishKrf(Oplog.java:7746)
>         at 
> org.apache.geode.internal.cache.BackupManager.backupOplog(BackupManager.java:580)
>         at 
> org.apache.geode.internal.cache.BackupManager.completeBackup(BackupManager.java:270)
>         at 
> org.apache.geode.internal.cache.BackupManager.doBackup(BackupManager.java:139)
>         at 
> org.apache.geode.admin.internal.FinishBackupRequest.createResponse(FinishBackupRequest.java:102)
>         at 
> org.apache.geode.internal.admin.remote.CliLegacyMessage.process(CliLegacyMessage.java:39)
>         at 
> org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374)
>         at 
> org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:440)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at 
> org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:666)
>         at 
> org.apache.geode.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:923)
>         at java.lang.Thread.run(Thread.java:748)
> The compactor thread already has the compactor lock.
> It is waiting for a DiskEntry lock.
> "Idle OplogCompactor" #379 daemon prio=10 os_prio=0 tid=0x00007ff020026000 
> nid=0x6916 waiting for monitor entry [0x00007ff08c7ce000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.geode.internal.cache.Oplog.writeOneKeyEntryForKRF(Oplog.java:3924)
>         - waiting to lock <0x00000000f1c55c70> (a 
> org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapStringKey2)
>   <====== waiting for DiskEntry lock
>         at org.apache.geode.internal.cache.Oplog.createKrf(Oplog.java:4201)   
>        <======= already in createKrf
>         - locked <0x00000000f1a72c30> (a 
> java.util.concurrent.atomic.AtomicBoolean)
>         at org.apache.geode.internal.cache.Oplog$2.run(Oplog.java:3875)
>         at 
> org.apache.geode.internal.cache.DiskStoreImpl$5.run(DiskStoreImpl.java:4386)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> The DiskEntry lock is held by this thread and is waiting for replies:
> "PartitionedRegion Message Processor26" #238 daemon prio=10 os_prio=0 
> tid=0x00007fef9c01f800 nid=0x3dd0 waiting on condition [0x00007ff08ebe9000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00000000f1e42200> (a 
> java.util.concurrent.CountDownLatch$Sync)
>         at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>         at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
>         at 
> org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:64)
>         at 
> org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:718)
>         at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:795)
>         at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:771)
>         at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:858)
>         at 
> org.apache.geode.internal.cache.DistributedCacheOperation.waitForAckIfNeeded(DistributedCacheOperation.java:761)
>         at 
> org.apache.geode.internal.cache.DistributedCacheOperation._distribute(DistributedCacheOperation.java:658)
>         at 
> org.apache.geode.internal.cache.DistributedCacheOperation.startOperation(DistributedCacheOperation.java:264)
>         at 
> org.apache.geode.internal.cache.BucketRegion.basicPutPart2(BucketRegion.java:665)
>         at 
> org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2839)
>         - locked <0x00000000f1c55c70> (a 
> org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapStringKey2)
>   <======= has the DiskEntryLock
>         at 
> org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:502)
>         at 
> org.apache.geode.internal.cache.PartitionedRegionDataStore.putLocally(PartitionedRegionDataStore.java:1222)
>         at 
> org.apache.geode.internal.cache.PartitionedRegionDataStore.putLocally(PartitionedRegionDataStore.java:1205)
>         at 
> org.apache.geode.internal.cache.PartitionedRegionDataView.putEntryOnRemote(PartitionedRegionDataView.java:99)
>         at 
> org.apache.geode.internal.cache.partitioned.PutMessage.operateOnPartitionedRegion(PutMessage.java:744)
>         at 
> org.apache.geode.internal.cache.partitioned.PartitionMessage.process(PartitionMessage.java:333)
>         at 
> org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374)
>         at 
> org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:440)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at 
> org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:666)
>         at 
> org.apache.geode.distributed.internal.DistributionManager$8$1.run(DistributionManager.java:1069)
>         at java.lang.Thread.run(Thread.java:748)
> The member replying can't because it's waiting for the backup lock.
> "P2P message reader for 
> rs-CommunicationsBTTest-2017-10-27-08-42-54-client-8(gemfire3_rs-CommunicationsBTTest-2017-10-27-08-42-54-client-8_14506:14506)<ec><v1>:1028
>  shared ordered uid=6 port=40404" #152 daemon prio=10 os_prio=0 
> tid=0x00007ff91800d800 nid=0x3bf3 waiting on condition [0x00007ffa29d4a000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00000000f0b63ad0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
>         at org.apache.geode.internal.cache.BackupLock.lock(BackupLock.java:95)
>         at org.apache.geode.internal.cache.Oplog.basicModify(Oplog.java:4612)
>         at org.apache.geode.internal.cache.Oplog.modify(Oplog.java:4450)
>         at 
> org.apache.geode.internal.cache.PersistentOplogSet.modify(PersistentOplogSet.java:188)
>         at 
> org.apache.geode.internal.cache.DiskStoreImpl.put(DiskStoreImpl.java:730)
>         at org.apache.geode.internal.cache.DiskRegion.put(DiskRegion.java:351)
>         at 
> org.apache.geode.internal.cache.entries.DiskEntry$Helper.writeBytesToDisk(DiskEntry.java:828)
>         at 
> org.apache.geode.internal.cache.entries.DiskEntry$Helper.basicUpdate(DiskEntry.java:932)
>         at 
> org.apache.geode.internal.cache.entries.DiskEntry$Helper.update(DiskEntry.java:859)
>         - locked <0x00000000f1610f70> (a 
> org.apache.geode.internal.cache.DiskId$PersistenceWithIntOffset)
>         at 
> org.apache.geode.internal.cache.entries.AbstractDiskRegionEntry.setValue(AbstractDiskRegionEntry.java:40)
>         at 
> org.apache.geode.internal.cache.entries.AbstractRegionEntry.setValueWithTombstoneCheck(AbstractRegionEntry.java:307)
>         at 
> org.apache.geode.internal.cache.EntryEventImpl.setNewValueInRegion(EntryEventImpl.java:1651)
>         at 
> org.apache.geode.internal.cache.EntryEventImpl.putExistingEntry(EntryEventImpl.java:1527)
>         at 
> org.apache.geode.internal.cache.AbstractRegionMap.updateEntry(AbstractRegionMap.java:2993)
>         at 
> org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2816)
>         - locked <0x00000000f1610f28> (a 
> org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapStringKey2)
>         at 
> org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:502)
>         at 
> org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:152)
>         at 
> org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5584)
>         at 
> org.apache.geode.internal.cache.AbstractUpdateOperation.doPutOrCreate(AbstractUpdateOperation.java:165)
>         at 
> org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.basicOperateOnRegion(AbstractUpdateOperation.java:272)
>         at 
> org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.operateOnRegion(AbstractUpdateOperation.java:243)
>         at 
> org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.basicProcess(DistributedCacheOperation.java:1190)
>         at 
> org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.process(DistributedCacheOperation.java:1091)
>         at 
> org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374)
>         at 
> org.apache.geode.distributed.internal.DistributionMessage.schedule(DistributionMessage.java:432)
>         at 
> org.apache.geode.distributed.internal.DistributionManager.scheduleIncomingMessage(DistributionManager.java:3552)
>         at 
> org.apache.geode.distributed.internal.DistributionManager.handleIncomingDMsg(DistributionManager.java:3186)
>         at 
> org.apache.geode.distributed.internal.DistributionManager$MyListener.messageReceived(DistributionManager.java:4361)
>         at 
> org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.dispatchMessage(GMSMembershipManager.java:1127)
>         at 
> org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.handleOrDeferMessage(GMSMembershipManager.java:1045)
>         at 
> org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager$MyDCReceiver.messageReceived(GMSMembershipManager.java:408)
>         at 
> org.apache.geode.distributed.internal.direct.DirectChannel.receive(DirectChannel.java:714)
>         at 
> org.apache.geode.internal.tcp.TCPConduit.messageReceived(TCPConduit.java:874)
>         at 
> org.apache.geode.internal.tcp.Connection.dispatchMessage(Connection.java:3966)
>         at 
> org.apache.geode.internal.tcp.Connection.processNIOBuffer(Connection.java:3552)
>         at 
> org.apache.geode.internal.tcp.Connection.runNioReader(Connection.java:1828)
>         at org.apache.geode.internal.tcp.Connection.run(Connection.java:1689)
>         at java.lang.Thread.run(Thread.java:748)
> Note that the replying member is also the member that initiated the backup:
> "vm_4_thr_43_client5_rs-CommunicationsBTTest-2017-10-27-08-42-54-client-8_14524"
>  #480 daemon prio=5 os_prio=0 tid=0x00007ff99c0b2800 nid=0x6911 waiting on 
> condition [0x00007ffa208cf000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00000000f214ef08> (a 
> java.util.concurrent.CountDownLatch$Sync)
>         at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>         at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
>         at 
> org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:77)
>         at 
> org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:694)
>         at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForReplies(ReplyProcessor21.java:644)
>         at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForReplies(ReplyProcessor21.java:624)
>         at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForReplies(ReplyProcessor21.java:519)
>         at 
> org.apache.geode.admin.internal.FinishBackupRequest.send(FinishBackupRequest.java:80)
>         at 
> org.apache.geode.admin.internal.BackupDataStoreHelper.backupAllMembers(BackupDataStoreHelper.java:47)
>         at 
> org.apache.geode.internal.cache.BackupUtil.backupAllMembers(BackupUtil.java:50)
>         at 
> org.apache.geode.admin.internal.AdminDistributedSystemImpl.backupAllMembers(AdminDistributedSystemImpl.java:2315)
>         at 
> org.apache.geode.admin.internal.AdminDistributedSystemImpl.backupAllMembers(AdminDistributedSystemImpl.java:2310)
>         at 
> org.apache.geode.admin.internal.AdminDistributedSystemImpl.backupAllMembers(AdminDistributedSystemImpl.java:2299)
>         at util.PersistenceUtil.doOnlineBackup(PersistenceUtil.java:627)
>         at parReg.ParRegTest.doConcOpsAndVerify(ParRegTest.java:1822)
>         at 
> parReg.ParRegTest.HydraTask_doConcOpsAndVerify(ParRegTest.java:1011)
>         at sun.reflect.GeneratedMethodAccessor401.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at hydra.MethExecutor.execute(MethExecutor.java:182)
>         at hydra.MethExecutor.execute(MethExecutor.java:150)
>         at hydra.TestTask.execute(TestTask.java:191)
>         at hydra.RemoteTestModule$1.run(RemoteTestModule.java:212)
> The replying member is still holding the backup lock because it waits for all 
> other members to reply to the FinishBackupRequest message
> before it processes its backup and releases the lock:
>   From FinishBackupRequest.java
>   public static Map<DistributedMember, Set<PersistentID>> send(DM dm, Set 
> recipients,
>       File targetDir, File baselineDir, boolean abort) {
>     FinishBackupRequest request = new FinishBackupRequest(targetDir, 
> baselineDir, abort);
>     request.setRecipients(recipients);
>     FinishBackupReplyProcessor replyProcessor = new 
> FinishBackupReplyProcessor(dm, recipients);
>     request.msgId = replyProcessor.getProcessorId();
>     dm.putOutgoing(request);
>     try {
>       replyProcessor.waitForReplies();   <======= wait for all other members 
> (but they won't finish until this member does its work; this is where we are 
> stuck)
>     } catch (ReplyException e) {
>       if (!(e.getCause() instanceof CancelException)) {
>         throw e;
>       }
>     } catch (InterruptedException e) {
>       e.printStackTrace();
>     }
>     AdminResponse response = request.createResponse((DistributionManager) 
> dm);   <====== this and the following lines do this member's work
>     response.setSender(dm.getDistributionManagerId());
>     replyProcessor.process(response);
>     return replyProcessor.results;
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to