[GitHub] [ignite] anton-vinogradov commented on a diff in pull request #10178: IGNITE-17457 Fix cluster lock after tx recovery

GitBox Mon, 08 Aug 2022 04:28:23 -0700


anton-vinogradov commented on code in PR #10178:
URL: https://github.com/apache/ignite/pull/10178#discussion_r940123476



##########
modules/core/src/test/java/org/apache/ignite/internal/processors/cache/transactions/TxRecoveryWithConcurrentRollbackTest.java:
##########
@@ -258,6 +263,120 @@ else if (g1Keys.contains(key))
         assertEquals(s1, s2);
     }
 
+    /**
+     * The test enforces the concurrent processing of the same prepared 
transaction
+     * both in the tx recovery procedure started due to primary node left and 
in the
+     * tx recovery request handler invoked by message from another backup node.
+     * <p>
+     * The idea is to have a 3-nodes cluster and a cache with 2 backups. So 
there
+     * will be 2 backup nodes to execute the tx recovery in parallel if 
primary one
+     * would fail. These backup nodes will send the tx recovery requests to 
each
+     * other, so the tx recovery request handler will be invoked as well.
+     * <p>
+     * Use several attempts to reproduce the race condition.
+     * <p>
+     * Expected result: transaction is finished on both backup nodes and the 
partition
+     * map exchange is completed as well.
+     */
+    @Test
+    public void testRecoveryNotDeadLockOnPrimaryFail() throws Exception {
+        backups = 2;
+        persistence = false;
+
+        for (int iter = 0; iter < 100; iter++) {
+            final IgniteEx grid0 = startGrid(0);
+
+            final IgniteEx grid1 = startGrid(1, 
(UnaryOperator<IgniteConfiguration>)cfg -> cfg
+                .setSystemThreadPoolSize(1).setStripedPoolSize(1));
+
+            final IgniteEx grid2 = startGrid(2);
+
+            grid0.cluster().state(ACTIVE);
+
+            final IgniteCache<Object, Object> cache = 
grid2.cache(DEFAULT_CACHE_NAME);
+
+            final Transaction tx = grid2.transactions().txStart(PESSIMISTIC, 
REPEATABLE_READ);
+
+            final Integer g2Key = primaryKeys(cache, 1, 0).get(0);
+
+            cache.put(g2Key, Boolean.TRUE);
+
+            final TransactionProxyImpl<?, ?> p = (TransactionProxyImpl<?, 
?>)tx;
+
+            p.tx().prepare(true);
+
+            final List<IgniteInternalTx> txs0 = txs(grid0);
+            final List<IgniteInternalTx> txs1 = txs(grid1);
+            final List<IgniteInternalTx> txs2 = txs(grid2);
+
+            assertTrue(txs0.size() == 1);
+            assertTrue(txs1.size() == 1);
+            assertTrue(txs2.size() == 1);
+
+            final CountDownLatch grid1NodeLeftEventLatch = new 
CountDownLatch(1);
+
+            grid1.events().localListen(new PE() {
+                @Override public boolean apply(Event evt) {
+                    grid1NodeLeftEventLatch.countDown();
+
+                    return true;
+                }
+            }, EventType.EVT_NODE_LEFT);
+
+            final CountDownLatch grid1BlockLatch = new CountDownLatch(1);
+
+            // Block recovery procedure processing on grid1.
+            grid1.context().pools().getSystemExecutorService().execute(() -> 
U.awaitQuiet(grid1BlockLatch));
+
+            final int stripe = U.safeAbs(p.tx().xidVersion().hashCode());
+
+            // Block stripe tx recovery request processing on grid1.
+            
grid1.context().pools().getStripedExecutorService().execute(stripe, () -> 
U.awaitQuiet(grid1BlockLatch));
+
+            // Prevent finish request processing on grid0.
+            spi(grid2).blockMessages(GridDhtTxFinishRequest.class, 
grid0.name());
+
+            // Prevent finish request processing on grid1.
+            spi(grid2).blockMessages(GridDhtTxFinishRequest.class, 
grid1.name());
+
+            runAsync(() -> {
+                grid2.close();
+
+                return null;
+            });
+
+            try {
+                tx.close();
+            }
+            catch (Exception ignored) {
+                // Don't bother if the transaction close throws in case grid2 
appear to be stopping or stopped already
+                // for this thread.
+            }
+
+            // Wait until grid1 node detects primary node left.
+            grid1NodeLeftEventLatch.await();
+
+            // Wait until grid1 receives the tx recovery request and the 
corresponding processing task is added into
+            // the queue.
+            assertTrue("tx recovery request received on grid1", 
GridTestUtils.waitForCondition(() -> grid1.context()

Review Comment:
   assert message is about situation when assert failed



##########
modules/core/src/test/java/org/apache/ignite/internal/processors/cache/transactions/TxRecoveryWithConcurrentRollbackTest.java:
##########
@@ -258,6 +263,120 @@ else if (g1Keys.contains(key))
         assertEquals(s1, s2);
     }
 
+    /**
+     * The test enforces the concurrent processing of the same prepared 
transaction
+     * both in the tx recovery procedure started due to primary node left and 
in the
+     * tx recovery request handler invoked by message from another backup node.
+     * <p>
+     * The idea is to have a 3-nodes cluster and a cache with 2 backups. So 
there
+     * will be 2 backup nodes to execute the tx recovery in parallel if 
primary one
+     * would fail. These backup nodes will send the tx recovery requests to 
each
+     * other, so the tx recovery request handler will be invoked as well.
+     * <p>
+     * Use several attempts to reproduce the race condition.
+     * <p>
+     * Expected result: transaction is finished on both backup nodes and the 
partition
+     * map exchange is completed as well.
+     */
+    @Test
+    public void testRecoveryNotDeadLockOnPrimaryFail() throws Exception {
+        backups = 2;
+        persistence = false;
+
+        for (int iter = 0; iter < 100; iter++) {
+            final IgniteEx grid0 = startGrid(0);
+
+            final IgniteEx grid1 = startGrid(1, 
(UnaryOperator<IgniteConfiguration>)cfg -> cfg
+                .setSystemThreadPoolSize(1).setStripedPoolSize(1));
+
+            final IgniteEx grid2 = startGrid(2);
+
+            grid0.cluster().state(ACTIVE);
+
+            final IgniteCache<Object, Object> cache = 
grid2.cache(DEFAULT_CACHE_NAME);
+
+            final Transaction tx = grid2.transactions().txStart(PESSIMISTIC, 
REPEATABLE_READ);
+
+            final Integer g2Key = primaryKeys(cache, 1, 0).get(0);

Review Comment:
   what is g2Key, could this be renamed according to its role?



##########
modules/core/src/test/java/org/apache/ignite/internal/processors/cache/transactions/TxRecoveryWithConcurrentRollbackTest.java:
##########
@@ -258,6 +263,120 @@ else if (g1Keys.contains(key))
         assertEquals(s1, s2);
     }
 
+    /**
+     * The test enforces the concurrent processing of the same prepared 
transaction
+     * both in the tx recovery procedure started due to primary node left and 
in the
+     * tx recovery request handler invoked by message from another backup node.
+     * <p>
+     * The idea is to have a 3-nodes cluster and a cache with 2 backups. So 
there
+     * will be 2 backup nodes to execute the tx recovery in parallel if 
primary one
+     * would fail. These backup nodes will send the tx recovery requests to 
each
+     * other, so the tx recovery request handler will be invoked as well.
+     * <p>
+     * Use several attempts to reproduce the race condition.
+     * <p>
+     * Expected result: transaction is finished on both backup nodes and the 
partition
+     * map exchange is completed as well.
+     */
+    @Test
+    public void testRecoveryNotDeadLockOnPrimaryFail() throws Exception {
+        backups = 2;
+        persistence = false;
+
+        for (int iter = 0; iter < 100; iter++) {
+            final IgniteEx grid0 = startGrid(0);
+
+            final IgniteEx grid1 = startGrid(1, 
(UnaryOperator<IgniteConfiguration>)cfg -> cfg
+                .setSystemThreadPoolSize(1).setStripedPoolSize(1));
+
+            final IgniteEx grid2 = startGrid(2);
+
+            grid0.cluster().state(ACTIVE);
+
+            final IgniteCache<Object, Object> cache = 
grid2.cache(DEFAULT_CACHE_NAME);
+
+            final Transaction tx = grid2.transactions().txStart(PESSIMISTIC, 
REPEATABLE_READ);
+
+            final Integer g2Key = primaryKeys(cache, 1, 0).get(0);
+
+            cache.put(g2Key, Boolean.TRUE);
+
+            final TransactionProxyImpl<?, ?> p = (TransactionProxyImpl<?, 
?>)tx;
+
+            p.tx().prepare(true);
+
+            final List<IgniteInternalTx> txs0 = txs(grid0);
+            final List<IgniteInternalTx> txs1 = txs(grid1);
+            final List<IgniteInternalTx> txs2 = txs(grid2);
+
+            assertTrue(txs0.size() == 1);
+            assertTrue(txs1.size() == 1);
+            assertTrue(txs2.size() == 1);
+
+            final CountDownLatch grid1NodeLeftEventLatch = new 
CountDownLatch(1);
+
+            grid1.events().localListen(new PE() {
+                @Override public boolean apply(Event evt) {
+                    grid1NodeLeftEventLatch.countDown();
+
+                    return true;
+                }
+            }, EventType.EVT_NODE_LEFT);
+
+            final CountDownLatch grid1BlockLatch = new CountDownLatch(1);
+
+            // Block recovery procedure processing on grid1.
+            grid1.context().pools().getSystemExecutorService().execute(() -> 
U.awaitQuiet(grid1BlockLatch));
+
+            final int stripe = U.safeAbs(p.tx().xidVersion().hashCode());
+
+            // Block stripe tx recovery request processing on grid1.
+            
grid1.context().pools().getStripedExecutorService().execute(stripe, () -> 
U.awaitQuiet(grid1BlockLatch));
+
+            // Prevent finish request processing on grid0.
+            spi(grid2).blockMessages(GridDhtTxFinishRequest.class, 
grid0.name());
+
+            // Prevent finish request processing on grid1.
+            spi(grid2).blockMessages(GridDhtTxFinishRequest.class, 
grid1.name());
+
+            runAsync(() -> {
+                grid2.close();
+
+                return null;
+            });
+
+            try {
+                tx.close();
+            }
+            catch (Exception ignored) {
+                // Don't bother if the transaction close throws in case grid2 
appear to be stopping or stopped already
+                // for this thread.
+            }

Review Comment:
   Cound we avoid doing thins we dont bother about?
   Do we really need to close the tx here?
   Could we refactor this somehow to avoid catched ignoring?



##########
modules/core/src/test/java/org/apache/ignite/internal/processors/cache/transactions/TxRecoveryWithConcurrentRollbackTest.java:
##########
@@ -258,6 +263,120 @@ else if (g1Keys.contains(key))
         assertEquals(s1, s2);
     }
 
+    /**
+     * The test enforces the concurrent processing of the same prepared 
transaction
+     * both in the tx recovery procedure started due to primary node left and 
in the
+     * tx recovery request handler invoked by message from another backup node.
+     * <p>
+     * The idea is to have a 3-nodes cluster and a cache with 2 backups. So 
there
+     * will be 2 backup nodes to execute the tx recovery in parallel if 
primary one
+     * would fail. These backup nodes will send the tx recovery requests to 
each
+     * other, so the tx recovery request handler will be invoked as well.
+     * <p>
+     * Use several attempts to reproduce the race condition.
+     * <p>
+     * Expected result: transaction is finished on both backup nodes and the 
partition
+     * map exchange is completed as well.
+     */
+    @Test
+    public void testRecoveryNotDeadLockOnPrimaryFail() throws Exception {
+        backups = 2;
+        persistence = false;
+
+        for (int iter = 0; iter < 100; iter++) {
+            final IgniteEx grid0 = startGrid(0);
+
+            final IgniteEx grid1 = startGrid(1, 
(UnaryOperator<IgniteConfiguration>)cfg -> cfg
+                .setSystemThreadPoolSize(1).setStripedPoolSize(1));
+
+            final IgniteEx grid2 = startGrid(2);
+
+            grid0.cluster().state(ACTIVE);
+
+            final IgniteCache<Object, Object> cache = 
grid2.cache(DEFAULT_CACHE_NAME);
+
+            final Transaction tx = grid2.transactions().txStart(PESSIMISTIC, 
REPEATABLE_READ);
+
+            final Integer g2Key = primaryKeys(cache, 1, 0).get(0);
+
+            cache.put(g2Key, Boolean.TRUE);
+
+            final TransactionProxyImpl<?, ?> p = (TransactionProxyImpl<?, 
?>)tx;
+
+            p.tx().prepare(true);
+
+            final List<IgniteInternalTx> txs0 = txs(grid0);
+            final List<IgniteInternalTx> txs1 = txs(grid1);
+            final List<IgniteInternalTx> txs2 = txs(grid2);
+
+            assertTrue(txs0.size() == 1);
+            assertTrue(txs1.size() == 1);
+            assertTrue(txs2.size() == 1);
+
+            final CountDownLatch grid1NodeLeftEventLatch = new 
CountDownLatch(1);
+
+            grid1.events().localListen(new PE() {
+                @Override public boolean apply(Event evt) {
+                    grid1NodeLeftEventLatch.countDown();
+
+                    return true;
+                }
+            }, EventType.EVT_NODE_LEFT);
+
+            final CountDownLatch grid1BlockLatch = new CountDownLatch(1);
+
+            // Block recovery procedure processing on grid1.
+            grid1.context().pools().getSystemExecutorService().execute(() -> 
U.awaitQuiet(grid1BlockLatch));
+
+            final int stripe = U.safeAbs(p.tx().xidVersion().hashCode());

Review Comment:
   Any chances to get this from pool's code, to make this more refactoring 
friendly?
   Will this always be 0 because of `setStripedPoolSize(1)`?



##########
modules/core/src/test/java/org/apache/ignite/internal/processors/cache/transactions/TxRecoveryWithConcurrentRollbackTest.java:
##########
@@ -258,6 +263,120 @@ else if (g1Keys.contains(key))
         assertEquals(s1, s2);
     }
 
+    /**
+     * The test enforces the concurrent processing of the same prepared 
transaction
+     * both in the tx recovery procedure started due to primary node left and 
in the
+     * tx recovery request handler invoked by message from another backup node.
+     * <p>
+     * The idea is to have a 3-nodes cluster and a cache with 2 backups. So 
there
+     * will be 2 backup nodes to execute the tx recovery in parallel if 
primary one
+     * would fail. These backup nodes will send the tx recovery requests to 
each
+     * other, so the tx recovery request handler will be invoked as well.
+     * <p>
+     * Use several attempts to reproduce the race condition.
+     * <p>
+     * Expected result: transaction is finished on both backup nodes and the 
partition
+     * map exchange is completed as well.
+     */
+    @Test
+    public void testRecoveryNotDeadLockOnPrimaryFail() throws Exception {
+        backups = 2;
+        persistence = false;
+
+        for (int iter = 0; iter < 100; iter++) {

Review Comment:
   Do we really need to do this 100 times?
   Looks lile you describing precise situation, not probabilistic



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [ignite] anton-vinogradov commented on a diff in pull request #10178: IGNITE-17457 Fix cluster lock after tx recovery

Reply via email to