[jira] [Created] (GEODE-5155) hang recovering transaction state for crashed server

Bruce Schuchardt (JIRA) Mon, 30 Apr 2018 13:31:39 -0700

Bruce Schuchardt created GEODE-5155:
---------------------------------------


             Summary: hang recovering transaction state for crashed server
                 Key: GEODE-5155
                 URL: https://issues.apache.org/jira/browse/GEODE-5155
             Project: Geode
          Issue Type: New Feature
          Components: distributed lock service, transactions
            Reporter: Bruce Schuchardt


A concourse job failed in 
DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with 
two threads stuck in this state:

{noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71
[vm2]     java.lang.Thread.State: WAITING
[vm2]   at java.lang.Object.wait(Native Method)
[vm2]   -  waiting on org.apache.geode.internal.cache.TXCommitMessage@2105ce6
[vm2]   at java.lang.Object.wait(Object.java:502)
[vm2]   at 
org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176)
[vm2]   at 
org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160)
[vm2]   at 
org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144)
[vm2]   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[vm2]   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[vm2]   at 
org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121)
[vm2]   at 
org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109)
[vm2]   at 
org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865)
[vm2]   at java.lang.Thread.run(Thread.java:748)
{noformat}

I modified the test to tighten up its forcedDisconnect and performOps methods 
to get transaction recovery to happen more reliably.

{code}
  public void forceDisconnect() throws Exception {
    Cache existingCache = basicGetCache();
    synchronized(commitLock) {
      committing = false;
      while (!committing) {
        commitLock.wait();
      }
    }
    if (existingCache != null && !existingCache.isClosed()) {
      
DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem());
    }
  }


  public void performOps() {
    Cache cache = getCache();
    Region region = cache.getRegion("TestRegion");
    DistributedLockService dlockService = 
DistributedLockService.getServiceNamed("Bulldog");
    Random random = new Random();

    while (!cache.isClosed()) {
      boolean locked = false;
      try {
        locked = dlockService.lock("testDLock", 500, 60_000);
        if (!locked) {
          // this could happen if we're starved out for 30sec by other VMs
          continue;
        }

        cache.getCacheTransactionManager().begin();

        region.put("TestKey", "TestValue" + random.nextInt(100000));

        TXManagerImpl mgr = (TXManagerImpl) 
getCache().getCacheTransactionManager();
        TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState();
        TXState txState = (TXState) txProxy.getRealDeal(null, null);
        txState.setBeforeSend(() -> {
          synchronized(commitLock) {
            committing = true;
            commitLock.notifyAll();
          }});

        try {
          cache.getCacheTransactionManager().commit();
        } catch (CommitConflictException e) {
          throw new RuntimeException("dlock failed to prevent a transaction 
conflict", e);
        }

        int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT);
        getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1);

      } catch (CancelException | IllegalStateException e) {
        // okay to ignore
      } finally {
        if (locked) {
          try {
            dlockService.unlock("testDLock");
          } catch (CancelException | IllegalStateException e) {
            // shutting down
          }
        }
      }
    }
  }
{code}

The problem is that the membership listener in TXCommit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (GEODE-5155) hang recovering transaction state for crashed server

Reply via email to