[GitHub] [ignite] skorotkov commented on a diff in pull request #10178: IGNITE-17457 Fix cluster lock after tx recovery

GitBox Mon, 08 Aug 2022 06:39:28 -0700


skorotkov commented on code in PR #10178:
URL: https://github.com/apache/ignite/pull/10178#discussion_r940252931



##########
modules/core/src/test/java/org/apache/ignite/internal/processors/cache/transactions/TxRecoveryWithConcurrentRollbackTest.java:
##########
@@ -258,6 +263,120 @@ else if (g1Keys.contains(key))
         assertEquals(s1, s2);
     }
 
+    /**
+     * The test enforces the concurrent processing of the same prepared 
transaction
+     * both in the tx recovery procedure started due to primary node left and 
in the
+     * tx recovery request handler invoked by message from another backup node.
+     * <p>
+     * The idea is to have a 3-nodes cluster and a cache with 2 backups. So 
there
+     * will be 2 backup nodes to execute the tx recovery in parallel if 
primary one
+     * would fail. These backup nodes will send the tx recovery requests to 
each
+     * other, so the tx recovery request handler will be invoked as well.
+     * <p>
+     * Use several attempts to reproduce the race condition.
+     * <p>
+     * Expected result: transaction is finished on both backup nodes and the 
partition
+     * map exchange is completed as well.
+     */
+    @Test
+    public void testRecoveryNotDeadLockOnPrimaryFail() throws Exception {
+        backups = 2;
+        persistence = false;
+
+        for (int iter = 0; iter < 100; iter++) {

Review Comment:
   It's a probabilistic one indeed. 
   
   We have a very specific interleaving situation here in the concurrent calls 
of IgniteTxAdapter#markFinalizing. As described in the Jira ticket it is as 
follows:
   
   > The following steps are executed on the node1 in two threads ("procedure" 
which is a system pool thread
   > executing the tx recovery procedure and "handler" which is a striped pool 
thread processing the tx
   > recovery request sent from node0):
   
   > - tx.finalization == NONE
   > - "procedure": calls markFinalizing(RECOVERY_FINISH)
   > - "handler": calls markFinalizing(RECOVERY_FINISH)
   > - "procedure": gets old tx.finlalization - it's NONE
   > - "handler": gets old tx.finalization - it's NONE
   > - "handler": updates tx.finalization - now it's RECOVERY_FINISH
   > - "procedure": tries to update tx.finalization via compareAndSet and fails 
since compare fails.
   > - "procedure": stops transaction processing and does not try to commit it.
   > - Transaction remains not finished on node1.
   
   So to have the problem reproduced we need to be lucky to:
   1. Call markFinalizing at the same time from two threads.  Which is not 
precise, since both threads do some other work before calling the 
markFinalizing. And I fialed to find any other  sync points which may be 
exploited.
   2. Be lucky to have a particular order of operations in the markFinalizing.
   
   ***
   Anyway I did some experiments trying to reduce number of repetitions needed  
on weekend.  I added several dummy threads to increase thread scheduling 
interleavings following the hint suggested at 
https://github.com/code-review-checklists/java-concurrency#test-workers-interleavings
   
   Unfortunately it doesn't help.
   
   In the experiment I perform 100 test runs in each configuration (with and 
without dummy threads). And collect number of repetions needed to reproduce the 
problem.  It took 33 repetitions on average for both cases. And there are 
attempts which fail to reproduce the problem even by 100 repetions (5 and 6 
times respectevely). The detailed results are:
   
   
     | No dummy threads | 8 dummy threads
   -- | -- | --
   Mean | 33.32 | 32.36
   Median | 26.5 | 27
   First Quartile | 11 | 13
   Third Quartile | 47.25 | 45.25
   Standard Deviation | 27.92 | 26.01
   Range | 99 | 99
   Minimum | 1 | 1
   Maximum | 100 | 100
   Sum | 3332 | 3236
   Count | 100 | 100
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [ignite] skorotkov commented on a diff in pull request #10178: IGNITE-17457 Fix cluster lock after tx recovery

Reply via email to