[ https://issues.apache.org/jira/browse/GEODE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257545#comment-16257545 ]
ASF subversion and git services commented on GEODE-3940: -------------------------------------------------------- Commit d59c73030b0d5123158829e8a32acfe28f6b1a9d in geode's branch refs/heads/feature/GEODE-3781 from [~dschneider] [ https://gitbox.apache.org/repos/asf?p=geode.git;h=d59c730 ] Merge pull request #1053 from apache/feature/GEODE-3940 Feature/geode 3940 > Backup can hang while trying to get a lock > ------------------------------------------- > > Key: GEODE-3940 > URL: https://issues.apache.org/jira/browse/GEODE-3940 > Project: Geode > Issue Type: Bug > Components: persistence > Reporter: Lynn Gallinat > Assignee: Darrel Schneider > Fix For: 1.4.0 > > > {noformat} > Backup can hang when createKrf cannot get the compactor lock. > "Pooled Message Processor 2" #196 daemon prio=10 os_prio=0 > tid=0x00007fef9801e000 nid=0x3cf3 waiting on condition [0x00007ff094cd5000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000000f1a72c60> (a > java.util.concurrent.locks.ReentrantLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) > at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) > at > org.apache.geode.internal.cache.Oplog.lockCompactor(Oplog.java:6046) > <============ trying to get compactor lock > at org.apache.geode.internal.cache.Oplog.createKrf(Oplog.java:4157) > <============ in createKrf > at org.apache.geode.internal.cache.Oplog.finishKrf(Oplog.java:7746) > at > org.apache.geode.internal.cache.BackupManager.backupOplog(BackupManager.java:580) > at > org.apache.geode.internal.cache.BackupManager.completeBackup(BackupManager.java:270) > at > org.apache.geode.internal.cache.BackupManager.doBackup(BackupManager.java:139) > at > org.apache.geode.admin.internal.FinishBackupRequest.createResponse(FinishBackupRequest.java:102) > at > org.apache.geode.internal.admin.remote.CliLegacyMessage.process(CliLegacyMessage.java:39) > at > org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374) > at > org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:440) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:666) > at > org.apache.geode.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:923) > at java.lang.Thread.run(Thread.java:748) > The compactor thread already has the compactor lock. > It is waiting for a DiskEntry lock. > "Idle OplogCompactor" #379 daemon prio=10 os_prio=0 tid=0x00007ff020026000 > nid=0x6916 waiting for monitor entry [0x00007ff08c7ce000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.geode.internal.cache.Oplog.writeOneKeyEntryForKRF(Oplog.java:3924) > - waiting to lock <0x00000000f1c55c70> (a > org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapStringKey2) > <====== waiting for DiskEntry lock > at org.apache.geode.internal.cache.Oplog.createKrf(Oplog.java:4201) > <======= already in createKrf > - locked <0x00000000f1a72c30> (a > java.util.concurrent.atomic.AtomicBoolean) > at org.apache.geode.internal.cache.Oplog$2.run(Oplog.java:3875) > at > org.apache.geode.internal.cache.DiskStoreImpl$5.run(DiskStoreImpl.java:4386) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > The DiskEntry lock is held by this thread and is waiting for replies: > "PartitionedRegion Message Processor26" #238 daemon prio=10 os_prio=0 > tid=0x00007fef9c01f800 nid=0x3dd0 waiting on condition [0x00007ff08ebe9000] > java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000000f1e42200> (a > java.util.concurrent.CountDownLatch$Sync) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) > at > org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:64) > at > org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:718) > at > org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:795) > at > org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:771) > at > org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:858) > at > org.apache.geode.internal.cache.DistributedCacheOperation.waitForAckIfNeeded(DistributedCacheOperation.java:761) > at > org.apache.geode.internal.cache.DistributedCacheOperation._distribute(DistributedCacheOperation.java:658) > at > org.apache.geode.internal.cache.DistributedCacheOperation.startOperation(DistributedCacheOperation.java:264) > at > org.apache.geode.internal.cache.BucketRegion.basicPutPart2(BucketRegion.java:665) > at > org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2839) > - locked <0x00000000f1c55c70> (a > org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapStringKey2) > <======= has the DiskEntryLock > at > org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:502) > at > org.apache.geode.internal.cache.PartitionedRegionDataStore.putLocally(PartitionedRegionDataStore.java:1222) > at > org.apache.geode.internal.cache.PartitionedRegionDataStore.putLocally(PartitionedRegionDataStore.java:1205) > at > org.apache.geode.internal.cache.PartitionedRegionDataView.putEntryOnRemote(PartitionedRegionDataView.java:99) > at > org.apache.geode.internal.cache.partitioned.PutMessage.operateOnPartitionedRegion(PutMessage.java:744) > at > org.apache.geode.internal.cache.partitioned.PartitionMessage.process(PartitionMessage.java:333) > at > org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374) > at > org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:440) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:666) > at > org.apache.geode.distributed.internal.DistributionManager$8$1.run(DistributionManager.java:1069) > at java.lang.Thread.run(Thread.java:748) > The member replying can't because it's waiting for the backup lock. > "P2P message reader for > rs-CommunicationsBTTest-2017-10-27-08-42-54-client-8(gemfire3_rs-CommunicationsBTTest-2017-10-27-08-42-54-client-8_14506:14506)<ec><v1>:1028 > shared ordered uid=6 port=40404" #152 daemon prio=10 os_prio=0 > tid=0x00007ff91800d800 nid=0x3bf3 waiting on condition [0x00007ffa29d4a000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000000f0b63ad0> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976) > at org.apache.geode.internal.cache.BackupLock.lock(BackupLock.java:95) > at org.apache.geode.internal.cache.Oplog.basicModify(Oplog.java:4612) > at org.apache.geode.internal.cache.Oplog.modify(Oplog.java:4450) > at > org.apache.geode.internal.cache.PersistentOplogSet.modify(PersistentOplogSet.java:188) > at > org.apache.geode.internal.cache.DiskStoreImpl.put(DiskStoreImpl.java:730) > at org.apache.geode.internal.cache.DiskRegion.put(DiskRegion.java:351) > at > org.apache.geode.internal.cache.entries.DiskEntry$Helper.writeBytesToDisk(DiskEntry.java:828) > at > org.apache.geode.internal.cache.entries.DiskEntry$Helper.basicUpdate(DiskEntry.java:932) > at > org.apache.geode.internal.cache.entries.DiskEntry$Helper.update(DiskEntry.java:859) > - locked <0x00000000f1610f70> (a > org.apache.geode.internal.cache.DiskId$PersistenceWithIntOffset) > at > org.apache.geode.internal.cache.entries.AbstractDiskRegionEntry.setValue(AbstractDiskRegionEntry.java:40) > at > org.apache.geode.internal.cache.entries.AbstractRegionEntry.setValueWithTombstoneCheck(AbstractRegionEntry.java:307) > at > org.apache.geode.internal.cache.EntryEventImpl.setNewValueInRegion(EntryEventImpl.java:1651) > at > org.apache.geode.internal.cache.EntryEventImpl.putExistingEntry(EntryEventImpl.java:1527) > at > org.apache.geode.internal.cache.AbstractRegionMap.updateEntry(AbstractRegionMap.java:2993) > at > org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2816) > - locked <0x00000000f1610f28> (a > org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapStringKey2) > at > org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:502) > at > org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:152) > at > org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5584) > at > org.apache.geode.internal.cache.AbstractUpdateOperation.doPutOrCreate(AbstractUpdateOperation.java:165) > at > org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.basicOperateOnRegion(AbstractUpdateOperation.java:272) > at > org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.operateOnRegion(AbstractUpdateOperation.java:243) > at > org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.basicProcess(DistributedCacheOperation.java:1190) > at > org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.process(DistributedCacheOperation.java:1091) > at > org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374) > at > org.apache.geode.distributed.internal.DistributionMessage.schedule(DistributionMessage.java:432) > at > org.apache.geode.distributed.internal.DistributionManager.scheduleIncomingMessage(DistributionManager.java:3552) > at > org.apache.geode.distributed.internal.DistributionManager.handleIncomingDMsg(DistributionManager.java:3186) > at > org.apache.geode.distributed.internal.DistributionManager$MyListener.messageReceived(DistributionManager.java:4361) > at > org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.dispatchMessage(GMSMembershipManager.java:1127) > at > org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.handleOrDeferMessage(GMSMembershipManager.java:1045) > at > org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager$MyDCReceiver.messageReceived(GMSMembershipManager.java:408) > at > org.apache.geode.distributed.internal.direct.DirectChannel.receive(DirectChannel.java:714) > at > org.apache.geode.internal.tcp.TCPConduit.messageReceived(TCPConduit.java:874) > at > org.apache.geode.internal.tcp.Connection.dispatchMessage(Connection.java:3966) > at > org.apache.geode.internal.tcp.Connection.processNIOBuffer(Connection.java:3552) > at > org.apache.geode.internal.tcp.Connection.runNioReader(Connection.java:1828) > at org.apache.geode.internal.tcp.Connection.run(Connection.java:1689) > at java.lang.Thread.run(Thread.java:748) > Note that the replying member is also the member that initiated the backup: > "vm_4_thr_43_client5_rs-CommunicationsBTTest-2017-10-27-08-42-54-client-8_14524" > #480 daemon prio=5 os_prio=0 tid=0x00007ff99c0b2800 nid=0x6911 waiting on > condition [0x00007ffa208cf000] > java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000000f214ef08> (a > java.util.concurrent.CountDownLatch$Sync) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) > at > org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:77) > at > org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:694) > at > org.apache.geode.distributed.internal.ReplyProcessor21.waitForReplies(ReplyProcessor21.java:644) > at > org.apache.geode.distributed.internal.ReplyProcessor21.waitForReplies(ReplyProcessor21.java:624) > at > org.apache.geode.distributed.internal.ReplyProcessor21.waitForReplies(ReplyProcessor21.java:519) > at > org.apache.geode.admin.internal.FinishBackupRequest.send(FinishBackupRequest.java:80) > at > org.apache.geode.admin.internal.BackupDataStoreHelper.backupAllMembers(BackupDataStoreHelper.java:47) > at > org.apache.geode.internal.cache.BackupUtil.backupAllMembers(BackupUtil.java:50) > at > org.apache.geode.admin.internal.AdminDistributedSystemImpl.backupAllMembers(AdminDistributedSystemImpl.java:2315) > at > org.apache.geode.admin.internal.AdminDistributedSystemImpl.backupAllMembers(AdminDistributedSystemImpl.java:2310) > at > org.apache.geode.admin.internal.AdminDistributedSystemImpl.backupAllMembers(AdminDistributedSystemImpl.java:2299) > at util.PersistenceUtil.doOnlineBackup(PersistenceUtil.java:627) > at parReg.ParRegTest.doConcOpsAndVerify(ParRegTest.java:1822) > at > parReg.ParRegTest.HydraTask_doConcOpsAndVerify(ParRegTest.java:1011) > at sun.reflect.GeneratedMethodAccessor401.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at hydra.MethExecutor.execute(MethExecutor.java:182) > at hydra.MethExecutor.execute(MethExecutor.java:150) > at hydra.TestTask.execute(TestTask.java:191) > at hydra.RemoteTestModule$1.run(RemoteTestModule.java:212) > The replying member is still holding the backup lock because it waits for all > other members to reply to the FinishBackupRequest message > before it processes its backup and releases the lock: > From FinishBackupRequest.java > public static Map<DistributedMember, Set<PersistentID>> send(DM dm, Set > recipients, > File targetDir, File baselineDir, boolean abort) { > FinishBackupRequest request = new FinishBackupRequest(targetDir, > baselineDir, abort); > request.setRecipients(recipients); > FinishBackupReplyProcessor replyProcessor = new > FinishBackupReplyProcessor(dm, recipients); > request.msgId = replyProcessor.getProcessorId(); > dm.putOutgoing(request); > try { > replyProcessor.waitForReplies(); <======= wait for all other members > (but they won't finish until this member does its work; this is where we are > stuck) > } catch (ReplyException e) { > if (!(e.getCause() instanceof CancelException)) { > throw e; > } > } catch (InterruptedException e) { > e.printStackTrace(); > } > AdminResponse response = request.createResponse((DistributionManager) > dm); <====== this and the following lines do this member's work > response.setSender(dm.getDistributionManagerId()); > replyProcessor.process(response); > return replyProcessor.results; > } > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)