[
https://issues.apache.org/jira/browse/GEODE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244535#comment-16244535
]
ASF GitHub Bot commented on GEODE-3940:
---------------------------------------
kirklund opened a new pull request #1038: GEODE-3940: fix deadlock in backup
messages
URL: https://github.com/apache/geode/pull/1038
There are 3 separate commits in this PR. 1) reintroduce Nick's commit for
GEODE-3801, 2) Lynn's fix for deadlock in backup messages, 3) prevent logger
warnings when sender processes the backup messages.
commit d55e6d6cae348fedb9f1341a9abde2096be3bead (HEAD -> feature/GEODE-3940,
origin/feature/GEODE-3940, kirklund-fork/feature/GEODE-3940)
Author: Kirk Lund <[email protected]>
Date: Wed Nov 8 10:25:07 2017 -0800
GEODE-3940: prevent warning when processing message by sender
Prevents logging warning when processing these message types by sender:
* FinishBackupRequest, FlushToDiskRequest, PrepareBackupRequest
commit 81ff81da9fc121607b35aca15d99e24cf0783e7f
Author: Lynn Gallinat <[email protected]>
Date: Fri Nov 3 17:09:42 2017 -0700
GEODE-3940: fix deadlock in backup messages
* introduce unit tests for backup messages
* add unit test for BackupLock
commit 080730b2df5c7ec0a17acd195a5b953dfc6a0d5c
Author: Nick Reich <[email protected]>
Date: Tue Oct 24 08:24:42 2017 -0700
GEODE-3801: Use hardlinks for backup oplog files (#963)
* Oplog files that are backed up are read-only. For this reason,
they can be transfered to the backup location through hard links,
instead of copying the file. This change improves speed of backups.
If the creation of a hard link fails, we revert to the existing copy
behavior.
* During backups, the copying of the oplog's krf file was being done
while
that file could still be in the process of writing. This change ensures
that if a krf is to be written, that it is finished and included in the
backup
* cleanup existing oplog tests
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Backup can hang while trying to get a lock
> -------------------------------------------
>
> Key: GEODE-3940
> URL: https://issues.apache.org/jira/browse/GEODE-3940
> Project: Geode
> Issue Type: Bug
> Components: persistence
> Reporter: Lynn Gallinat
> Assignee: Kirk Lund
> Priority: Trivial
>
> {noformat}
> Backup can hang when createKrf cannot get the compactor lock.
> "Pooled Message Processor 2" #196 daemon prio=10 os_prio=0
> tid=0x00007fef9801e000 nid=0x3cf3 waiting on condition [0x00007ff094cd5000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000000f1a72c60> (a
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
> at
> java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
> at
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
> at
> org.apache.geode.internal.cache.Oplog.lockCompactor(Oplog.java:6046)
> <============ trying to get compactor lock
> at org.apache.geode.internal.cache.Oplog.createKrf(Oplog.java:4157)
> <============ in createKrf
> at org.apache.geode.internal.cache.Oplog.finishKrf(Oplog.java:7746)
> at
> org.apache.geode.internal.cache.BackupManager.backupOplog(BackupManager.java:580)
> at
> org.apache.geode.internal.cache.BackupManager.completeBackup(BackupManager.java:270)
> at
> org.apache.geode.internal.cache.BackupManager.doBackup(BackupManager.java:139)
> at
> org.apache.geode.admin.internal.FinishBackupRequest.createResponse(FinishBackupRequest.java:102)
> at
> org.apache.geode.internal.admin.remote.CliLegacyMessage.process(CliLegacyMessage.java:39)
> at
> org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374)
> at
> org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:440)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at
> org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:666)
> at
> org.apache.geode.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:923)
> at java.lang.Thread.run(Thread.java:748)
> The compactor thread already has the compactor lock.
> It is waiting for a DiskEntry lock.
> "Idle OplogCompactor" #379 daemon prio=10 os_prio=0 tid=0x00007ff020026000
> nid=0x6916 waiting for monitor entry [0x00007ff08c7ce000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at
> org.apache.geode.internal.cache.Oplog.writeOneKeyEntryForKRF(Oplog.java:3924)
> - waiting to lock <0x00000000f1c55c70> (a
> org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapStringKey2)
> <====== waiting for DiskEntry lock
> at org.apache.geode.internal.cache.Oplog.createKrf(Oplog.java:4201)
> <======= already in createKrf
> - locked <0x00000000f1a72c30> (a
> java.util.concurrent.atomic.AtomicBoolean)
> at org.apache.geode.internal.cache.Oplog$2.run(Oplog.java:3875)
> at
> org.apache.geode.internal.cache.DiskStoreImpl$5.run(DiskStoreImpl.java:4386)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> The DiskEntry lock is held by this thread and is waiting for replies:
> "PartitionedRegion Message Processor26" #238 daemon prio=10 os_prio=0
> tid=0x00007fef9c01f800 nid=0x3dd0 waiting on condition [0x00007ff08ebe9000]
> java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000000f1e42200> (a
> java.util.concurrent.CountDownLatch$Sync)
> at
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
> at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
> at
> org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:64)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:718)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:795)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:771)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:858)
> at
> org.apache.geode.internal.cache.DistributedCacheOperation.waitForAckIfNeeded(DistributedCacheOperation.java:761)
> at
> org.apache.geode.internal.cache.DistributedCacheOperation._distribute(DistributedCacheOperation.java:658)
> at
> org.apache.geode.internal.cache.DistributedCacheOperation.startOperation(DistributedCacheOperation.java:264)
> at
> org.apache.geode.internal.cache.BucketRegion.basicPutPart2(BucketRegion.java:665)
> at
> org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2839)
> - locked <0x00000000f1c55c70> (a
> org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapStringKey2)
> <======= has the DiskEntryLock
> at
> org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:502)
> at
> org.apache.geode.internal.cache.PartitionedRegionDataStore.putLocally(PartitionedRegionDataStore.java:1222)
> at
> org.apache.geode.internal.cache.PartitionedRegionDataStore.putLocally(PartitionedRegionDataStore.java:1205)
> at
> org.apache.geode.internal.cache.PartitionedRegionDataView.putEntryOnRemote(PartitionedRegionDataView.java:99)
> at
> org.apache.geode.internal.cache.partitioned.PutMessage.operateOnPartitionedRegion(PutMessage.java:744)
> at
> org.apache.geode.internal.cache.partitioned.PartitionMessage.process(PartitionMessage.java:333)
> at
> org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374)
> at
> org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:440)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at
> org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:666)
> at
> org.apache.geode.distributed.internal.DistributionManager$8$1.run(DistributionManager.java:1069)
> at java.lang.Thread.run(Thread.java:748)
> The member replying can't because it's waiting for the backup lock.
> "P2P message reader for
> rs-CommunicationsBTTest-2017-10-27-08-42-54-client-8(gemfire3_rs-CommunicationsBTTest-2017-10-27-08-42-54-client-8_14506:14506)<ec><v1>:1028
> shared ordered uid=6 port=40404" #152 daemon prio=10 os_prio=0
> tid=0x00007ff91800d800 nid=0x3bf3 waiting on condition [0x00007ffa29d4a000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000000f0b63ad0> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
> at org.apache.geode.internal.cache.BackupLock.lock(BackupLock.java:95)
> at org.apache.geode.internal.cache.Oplog.basicModify(Oplog.java:4612)
> at org.apache.geode.internal.cache.Oplog.modify(Oplog.java:4450)
> at
> org.apache.geode.internal.cache.PersistentOplogSet.modify(PersistentOplogSet.java:188)
> at
> org.apache.geode.internal.cache.DiskStoreImpl.put(DiskStoreImpl.java:730)
> at org.apache.geode.internal.cache.DiskRegion.put(DiskRegion.java:351)
> at
> org.apache.geode.internal.cache.entries.DiskEntry$Helper.writeBytesToDisk(DiskEntry.java:828)
> at
> org.apache.geode.internal.cache.entries.DiskEntry$Helper.basicUpdate(DiskEntry.java:932)
> at
> org.apache.geode.internal.cache.entries.DiskEntry$Helper.update(DiskEntry.java:859)
> - locked <0x00000000f1610f70> (a
> org.apache.geode.internal.cache.DiskId$PersistenceWithIntOffset)
> at
> org.apache.geode.internal.cache.entries.AbstractDiskRegionEntry.setValue(AbstractDiskRegionEntry.java:40)
> at
> org.apache.geode.internal.cache.entries.AbstractRegionEntry.setValueWithTombstoneCheck(AbstractRegionEntry.java:307)
> at
> org.apache.geode.internal.cache.EntryEventImpl.setNewValueInRegion(EntryEventImpl.java:1651)
> at
> org.apache.geode.internal.cache.EntryEventImpl.putExistingEntry(EntryEventImpl.java:1527)
> at
> org.apache.geode.internal.cache.AbstractRegionMap.updateEntry(AbstractRegionMap.java:2993)
> at
> org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2816)
> - locked <0x00000000f1610f28> (a
> org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapStringKey2)
> at
> org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:502)
> at
> org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:152)
> at
> org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5584)
> at
> org.apache.geode.internal.cache.AbstractUpdateOperation.doPutOrCreate(AbstractUpdateOperation.java:165)
> at
> org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.basicOperateOnRegion(AbstractUpdateOperation.java:272)
> at
> org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.operateOnRegion(AbstractUpdateOperation.java:243)
> at
> org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.basicProcess(DistributedCacheOperation.java:1190)
> at
> org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.process(DistributedCacheOperation.java:1091)
> at
> org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374)
> at
> org.apache.geode.distributed.internal.DistributionMessage.schedule(DistributionMessage.java:432)
> at
> org.apache.geode.distributed.internal.DistributionManager.scheduleIncomingMessage(DistributionManager.java:3552)
> at
> org.apache.geode.distributed.internal.DistributionManager.handleIncomingDMsg(DistributionManager.java:3186)
> at
> org.apache.geode.distributed.internal.DistributionManager$MyListener.messageReceived(DistributionManager.java:4361)
> at
> org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.dispatchMessage(GMSMembershipManager.java:1127)
> at
> org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.handleOrDeferMessage(GMSMembershipManager.java:1045)
> at
> org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager$MyDCReceiver.messageReceived(GMSMembershipManager.java:408)
> at
> org.apache.geode.distributed.internal.direct.DirectChannel.receive(DirectChannel.java:714)
> at
> org.apache.geode.internal.tcp.TCPConduit.messageReceived(TCPConduit.java:874)
> at
> org.apache.geode.internal.tcp.Connection.dispatchMessage(Connection.java:3966)
> at
> org.apache.geode.internal.tcp.Connection.processNIOBuffer(Connection.java:3552)
> at
> org.apache.geode.internal.tcp.Connection.runNioReader(Connection.java:1828)
> at org.apache.geode.internal.tcp.Connection.run(Connection.java:1689)
> at java.lang.Thread.run(Thread.java:748)
> Note that the replying member is also the member that initiated the backup:
> "vm_4_thr_43_client5_rs-CommunicationsBTTest-2017-10-27-08-42-54-client-8_14524"
> #480 daemon prio=5 os_prio=0 tid=0x00007ff99c0b2800 nid=0x6911 waiting on
> condition [0x00007ffa208cf000]
> java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000000f214ef08> (a
> java.util.concurrent.CountDownLatch$Sync)
> at
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
> at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
> at
> org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:77)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:694)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForReplies(ReplyProcessor21.java:644)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForReplies(ReplyProcessor21.java:624)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForReplies(ReplyProcessor21.java:519)
> at
> org.apache.geode.admin.internal.FinishBackupRequest.send(FinishBackupRequest.java:80)
> at
> org.apache.geode.admin.internal.BackupDataStoreHelper.backupAllMembers(BackupDataStoreHelper.java:47)
> at
> org.apache.geode.internal.cache.BackupUtil.backupAllMembers(BackupUtil.java:50)
> at
> org.apache.geode.admin.internal.AdminDistributedSystemImpl.backupAllMembers(AdminDistributedSystemImpl.java:2315)
> at
> org.apache.geode.admin.internal.AdminDistributedSystemImpl.backupAllMembers(AdminDistributedSystemImpl.java:2310)
> at
> org.apache.geode.admin.internal.AdminDistributedSystemImpl.backupAllMembers(AdminDistributedSystemImpl.java:2299)
> at util.PersistenceUtil.doOnlineBackup(PersistenceUtil.java:627)
> at parReg.ParRegTest.doConcOpsAndVerify(ParRegTest.java:1822)
> at
> parReg.ParRegTest.HydraTask_doConcOpsAndVerify(ParRegTest.java:1011)
> at sun.reflect.GeneratedMethodAccessor401.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at hydra.MethExecutor.execute(MethExecutor.java:182)
> at hydra.MethExecutor.execute(MethExecutor.java:150)
> at hydra.TestTask.execute(TestTask.java:191)
> at hydra.RemoteTestModule$1.run(RemoteTestModule.java:212)
> The replying member is still holding the backup lock because it waits for all
> other members to reply to the FinishBackupRequest message
> before it processes its backup and releases the lock:
> From FinishBackupRequest.java
> public static Map<DistributedMember, Set<PersistentID>> send(DM dm, Set
> recipients,
> File targetDir, File baselineDir, boolean abort) {
> FinishBackupRequest request = new FinishBackupRequest(targetDir,
> baselineDir, abort);
> request.setRecipients(recipients);
> FinishBackupReplyProcessor replyProcessor = new
> FinishBackupReplyProcessor(dm, recipients);
> request.msgId = replyProcessor.getProcessorId();
> dm.putOutgoing(request);
> try {
> replyProcessor.waitForReplies(); <======= wait for all other members
> (but they won't finish until this member does its work; this is where we are
> stuck)
> } catch (ReplyException e) {
> if (!(e.getCause() instanceof CancelException)) {
> throw e;
> }
> } catch (InterruptedException e) {
> e.printStackTrace();
> }
> AdminResponse response = request.createResponse((DistributionManager)
> dm); <====== this and the following lines do this member's work
> response.setSender(dm.getDistributionManagerId());
> replyProcessor.process(response);
> return replyProcessor.results;
> }
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)