[
https://issues.apache.org/jira/browse/GEODE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bruce Schuchardt resolved GEODE-5155.
-------------------------------------
Resolution: Fixed
Fix Version/s: 1.7.0
> hang recovering transaction state for crashed server
> ----------------------------------------------------
>
> Key: GEODE-5155
> URL: https://issues.apache.org/jira/browse/GEODE-5155
> Project: Geode
> Issue Type: Bug
> Components: distributed lock service, transactions
> Affects Versions: 1.7.0
> Reporter: Bruce Schuchardt
> Assignee: Bruce Schuchardt
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.7.0
>
> Time Spent: 40m
> Remaining Estimate: 0h
>
> A concourse job failed in
> DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with
> two threads stuck in this state:
> {noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71
> [vm2] java.lang.Thread.State: WAITING
> [vm2] at java.lang.Object.wait(Native Method)
> [vm2] - waiting on
> org.apache.geode.internal.cache.TXCommitMessage@2105ce6
> [vm2] at java.lang.Object.wait(Object.java:502)
> [vm2] at
> org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176)
> [vm2] at
> org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160)
> [vm2] at
> org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144)
> [vm2] at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [vm2] at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [vm2] at
> org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121)
> [vm2] at
> org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109)
> [vm2] at
> org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865)
> [vm2] at java.lang.Thread.run(Thread.java:748)
> {noformat}
> I modified the test to tighten up its forcedDisconnect and performOps methods
> to get transaction recovery to happen more reliably.
> {code}
> public void forceDisconnect() throws Exception {
> Cache existingCache = basicGetCache();
> synchronized(commitLock) {
> committing = false;
> while (!committing) {
> commitLock.wait();
> }
> }
> if (existingCache != null && !existingCache.isClosed()) {
>
> DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem());
> }
> }
> public void performOps() {
> Cache cache = getCache();
> Region region = cache.getRegion("TestRegion");
> DistributedLockService dlockService =
> DistributedLockService.getServiceNamed("Bulldog");
> Random random = new Random();
> while (!cache.isClosed()) {
> boolean locked = false;
> try {
> locked = dlockService.lock("testDLock", 500, 60_000);
> if (!locked) {
> // this could happen if we're starved out for 30sec by other VMs
> continue;
> }
> cache.getCacheTransactionManager().begin();
> region.put("TestKey", "TestValue" + random.nextInt(100000));
> TXManagerImpl mgr = (TXManagerImpl)
> getCache().getCacheTransactionManager();
> TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState();
> TXState txState = (TXState) txProxy.getRealDeal(null, null);
> txState.setBeforeSend(() -> {
> synchronized(commitLock) {
> committing = true;
> commitLock.notifyAll();
> }});
> try {
> cache.getCacheTransactionManager().commit();
> } catch (CommitConflictException e) {
> throw new RuntimeException("dlock failed to prevent a transaction
> conflict", e);
> }
> int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT);
> getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1);
> } catch (CancelException | IllegalStateException e) {
> // okay to ignore
> } finally {
> if (locked) {
> try {
> dlockService.unlock("testDLock");
> } catch (CancelException | IllegalStateException e) {
> // shutting down
> }
> }
> }
> }
> }
> {code}
> The problem is that the membership listener in TXCommitMessage is removing
> itself from the transaction map in TXFarSideCMTracker without setting any
> state that the recovery message can check. The recovery method is waiting
> like this:
> {code}
> synchronized (this.txInProgress) {
> mess = (TXCommitMessage) this.txInProgress.get(lk);
> }
> if (mess != null) {
> synchronized (mess) {
> // tx in progress, we must wait until its done
> while (!mess.wasProcessed()) {
> try {
> mess.wait();
> } catch (InterruptedException ie) {
> Thread.currentThread().interrupt();
> logger.error(LocalizedMessage.create(
>
> LocalizedStrings.TxFarSideTracker_WAITING_TO_COMPLETE_ON_MESSAGE_0_CAUGHT_AN_INTERRUPTED_EXCEPTION,
> mess), ie);
> break;
> }
> }
> }
> {code}
> We could probably change this method to make sure that the message is still
> in the map instead of only checking wasProcessed().
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)