Bruce Schuchardt created GEODE-5155: ---------------------------------------
Summary: hang recovering transaction state for crashed server Key: GEODE-5155 URL: https://issues.apache.org/jira/browse/GEODE-5155 Project: Geode Issue Type: New Feature Components: distributed lock service, transactions Reporter: Bruce Schuchardt A concourse job failed in DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with two threads stuck in this state: {noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71 [vm2] java.lang.Thread.State: WAITING [vm2] at java.lang.Object.wait(Native Method) [vm2] - waiting on org.apache.geode.internal.cache.TXCommitMessage@2105ce6 [vm2] at java.lang.Object.wait(Object.java:502) [vm2] at org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176) [vm2] at org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160) [vm2] at org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144) [vm2] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [vm2] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [vm2] at org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121) [vm2] at org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109) [vm2] at org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865) [vm2] at java.lang.Thread.run(Thread.java:748) {noformat} I modified the test to tighten up its forcedDisconnect and performOps methods to get transaction recovery to happen more reliably. {code} public void forceDisconnect() throws Exception { Cache existingCache = basicGetCache(); synchronized(commitLock) { committing = false; while (!committing) { commitLock.wait(); } } if (existingCache != null && !existingCache.isClosed()) { DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem()); } } public void performOps() { Cache cache = getCache(); Region region = cache.getRegion("TestRegion"); DistributedLockService dlockService = DistributedLockService.getServiceNamed("Bulldog"); Random random = new Random(); while (!cache.isClosed()) { boolean locked = false; try { locked = dlockService.lock("testDLock", 500, 60_000); if (!locked) { // this could happen if we're starved out for 30sec by other VMs continue; } cache.getCacheTransactionManager().begin(); region.put("TestKey", "TestValue" + random.nextInt(100000)); TXManagerImpl mgr = (TXManagerImpl) getCache().getCacheTransactionManager(); TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState(); TXState txState = (TXState) txProxy.getRealDeal(null, null); txState.setBeforeSend(() -> { synchronized(commitLock) { committing = true; commitLock.notifyAll(); }}); try { cache.getCacheTransactionManager().commit(); } catch (CommitConflictException e) { throw new RuntimeException("dlock failed to prevent a transaction conflict", e); } int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT); getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1); } catch (CancelException | IllegalStateException e) { // okay to ignore } finally { if (locked) { try { dlockService.unlock("testDLock"); } catch (CancelException | IllegalStateException e) { // shutting down } } } } } {code} The problem is that the membership listener in TXCommit -- This message was sent by Atlassian JIRA (v7.6.3#76005)