skorotkov commented on code in PR #10178:
URL: https://github.com/apache/ignite/pull/10178#discussion_r940252931
##########
modules/core/src/test/java/org/apache/ignite/internal/processors/cache/transactions/TxRecoveryWithConcurrentRollbackTest.java:
##########
@@ -258,6 +263,120 @@ else if (g1Keys.contains(key))
assertEquals(s1, s2);
}
+ /**
+ * The test enforces the concurrent processing of the same prepared
transaction
+ * both in the tx recovery procedure started due to primary node left and
in the
+ * tx recovery request handler invoked by message from another backup node.
+ * <p>
+ * The idea is to have a 3-nodes cluster and a cache with 2 backups. So
there
+ * will be 2 backup nodes to execute the tx recovery in parallel if
primary one
+ * would fail. These backup nodes will send the tx recovery requests to
each
+ * other, so the tx recovery request handler will be invoked as well.
+ * <p>
+ * Use several attempts to reproduce the race condition.
+ * <p>
+ * Expected result: transaction is finished on both backup nodes and the
partition
+ * map exchange is completed as well.
+ */
+ @Test
+ public void testRecoveryNotDeadLockOnPrimaryFail() throws Exception {
+ backups = 2;
+ persistence = false;
+
+ for (int iter = 0; iter < 100; iter++) {
Review Comment:
It's a probabilistic one indeed.
We have a very specific interleaving situation here in the concurrent calls
of IgniteTxAdapter#markFinalizing. As described in the Jira ticket it is as
follows:
> The following steps are executed on the node1 in two threads ("procedure"
which is a system pool thread
> executing the tx recovery procedure and "handler" which is a striped pool
thread processing the tx
> recovery request sent from node0):
> - tx.finalization == NONE
> - "procedure": calls markFinalizing(RECOVERY_FINISH)
> - "handler": calls markFinalizing(RECOVERY_FINISH)
> - "procedure": gets old tx.finlalization - it's NONE
> - "handler": gets old tx.finalization - it's NONE
> - "handler": updates tx.finalization - now it's RECOVERY_FINISH
> - "procedure": tries to update tx.finalization via compareAndSet and fails
since compare fails.
> - "procedure": stops transaction processing and does not try to commit it.
> - Transaction remains not finished on node1.
So to have the problem reproduced we need to be lucky to:
1. Call markFinalizing at the same time from two threads. Which is not
precise, since both threads do some other work before calling the
markFinalizing. And I fialed to find any other sync points which may be
exploited.
2. Be lucky to have a particular order of operations in the markFinalizing.
***
Anyway I did some experiments trying to reduce number of repetitions needed
on weekend. I added several dummy threads to increase thread scheduling
interleavings following the hint suggested at
https://github.com/code-review-checklists/java-concurrency#test-workers-interleavings
Unfortunately it doesn't help.
In the experiment I perform 100 test runs in each configuration (with and
without dummy threads). And collect number of repetions needed to reproduce the
problem. It took 33 repetitions on average for both cases. And there are
attempts which fail to reproduce the problem even by 100 repetions (5 and 6
times respectevely). The detailed results are:
| No dummy threads | 8 dummy threads
-- | -- | --
Mean | 33.32 | 32.36
Median | 26.5 | 27
First Quartile | 11 | 13
Third Quartile | 47.25 | 45.25
Standard Deviation | 27.92 | 26.01
Range | 99 | 99
Minimum | 1 | 1
Maximum | 100 | 100
Sum | 3332 | 3236
Count | 100 | 100
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]