[jira] [Updated] (IGNITE-20303) "Raft group on the node is already started" exception when pending and planned assignment changed faster then rebalance

Vyacheslav Koptilin (Jira) Mon, 04 Sep 2023 05:50:05 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-20303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vyacheslav Koptilin updated IGNITE-20303:
-----------------------------------------
    Priority: Blocker  (was: Major)

> "Raft group on the node is already started" exception when pending and 
> planned assignment changed faster then rebalance
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-20303
>                 URL: https://issues.apache.org/jira/browse/IGNITE-20303
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Sergey Uttsel
>            Assignee: Sergey Uttsel
>            Priority: Blocker
>              Labels: ignite-3
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> If many changes of assignment are happened quickly then rebalance does not 
> have time to be completed for each change. In this case exception is thrown:
> {code:java}
> 2023-08-24T16:58:51,328][ERROR][%irdt_ttqr_20000%tableManager-io-10][WatchProcessor]
>  Error occurred when processing a watch event
>  org.apache.ignite.lang.IgniteInternalException: Raft group on the node is 
> already started [nodeId=RaftNodeId [groupId=1_part_0, peer=Peer 
> [consistentId=irdt_ttqr_20000, idx=0]]]
>       at 
> org.apache.ignite.internal.raft.Loza.startRaftGroupNodeInternal(Loza.java:342)
>  ~[main/:?]
>       at 
> org.apache.ignite.internal.raft.Loza.startRaftGroupNode(Loza.java:230) 
> ~[main/:?]
>       at 
> org.apache.ignite.internal.raft.Loza.startRaftGroupNode(Loza.java:203) 
> ~[main/:?]
>       at 
> org.apache.ignite.internal.table.distributed.TableManager.startPartitionRaftGroupNode(TableManager.java:2361)
>  ~[main/:?]
>       at 
> org.apache.ignite.internal.table.distributed.TableManager.lambda$handleChangePendingAssignmentEvent$98(TableManager.java:2261)
>  ~[main/:?]
>       at 
> org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:922) 
> ~[main/:?]
>       at 
> org.apache.ignite.internal.table.distributed.TableManager.lambda$handleChangePendingAssignmentEvent$99(TableManager.java:2259)
>  ~[main/:?]
>       at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
>  ~[?:?]
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  ~[?:?]
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  ~[?:?]
>       at java.lang.Thread.run(Thread.java:834) ~[?:?]
> {code}
> The reproducer based on ItRebalanceDistributedTest#testThreeQueuedRebalances. 
> See exception in the test log:
> {code:java}
>     @Test
>     void testThreeQueuedRebalances() throws Exception {
>         Node node = getNode(0);
>         createZone(node, ZONE_NAME, 1, 1);
>         createTable(node, ZONE_NAME, TABLE_NAME);
>         assertTrue(waitForCondition(() -> getPartitionClusterNodes(node, 
> 0).size() == 1, AWAIT_TIMEOUT_MILLIS));
>         alterZone(node, ZONE_NAME, 2);
>         alterZone(node, ZONE_NAME, 3);
>         alterZone(node, ZONE_NAME, 2);
>         alterZone(node, ZONE_NAME, 3);
>         alterZone(node, ZONE_NAME, 2);
>         alterZone(node, ZONE_NAME, 3);
>         alterZone(node, ZONE_NAME, 2);
>         alterZone(node, ZONE_NAME, 3);
>         alterZone(node, ZONE_NAME, 2);
>         alterZone(node, ZONE_NAME, 3);
>         alterZone(node, ZONE_NAME, 2);
>         waitPartitionAssignmentsSyncedToExpected(0, 2);
>         checkPartitionNodes(0, 2);
>     }
> {code}
> We can fix it by a check if the raft node and the Replica are created before 
> startPartitionRaftGroupNode and startReplicaWithNewListener in 
> TableManager#handleChangePendingAssignmentEvent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-20303) "Raft group on the node is already started" exception when pending and planned assignment changed faster then rebalance

Reply via email to