[
https://issues.apache.org/jira/browse/IGNITE-20303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vyacheslav Koptilin updated IGNITE-20303:
-----------------------------------------
Priority: Blocker (was: Major)
> "Raft group on the node is already started" exception when pending and
> planned assignment changed faster then rebalance
> -----------------------------------------------------------------------------------------------------------------------
>
> Key: IGNITE-20303
> URL: https://issues.apache.org/jira/browse/IGNITE-20303
> Project: Ignite
> Issue Type: Bug
> Reporter: Sergey Uttsel
> Assignee: Sergey Uttsel
> Priority: Blocker
> Labels: ignite-3
> Time Spent: 10m
> Remaining Estimate: 0h
>
> If many changes of assignment are happened quickly then rebalance does not
> have time to be completed for each change. In this case exception is thrown:
> {code:java}
> 2023-08-24T16:58:51,328][ERROR][%irdt_ttqr_20000%tableManager-io-10][WatchProcessor]
> Error occurred when processing a watch event
> org.apache.ignite.lang.IgniteInternalException: Raft group on the node is
> already started [nodeId=RaftNodeId [groupId=1_part_0, peer=Peer
> [consistentId=irdt_ttqr_20000, idx=0]]]
> at
> org.apache.ignite.internal.raft.Loza.startRaftGroupNodeInternal(Loza.java:342)
> ~[main/:?]
> at
> org.apache.ignite.internal.raft.Loza.startRaftGroupNode(Loza.java:230)
> ~[main/:?]
> at
> org.apache.ignite.internal.raft.Loza.startRaftGroupNode(Loza.java:203)
> ~[main/:?]
> at
> org.apache.ignite.internal.table.distributed.TableManager.startPartitionRaftGroupNode(TableManager.java:2361)
> ~[main/:?]
> at
> org.apache.ignite.internal.table.distributed.TableManager.lambda$handleChangePendingAssignmentEvent$98(TableManager.java:2261)
> ~[main/:?]
> at
> org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:922)
> ~[main/:?]
> at
> org.apache.ignite.internal.table.distributed.TableManager.lambda$handleChangePendingAssignmentEvent$99(TableManager.java:2259)
> ~[main/:?]
> at
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
> ~[?:?]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> ~[?:?]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> ~[?:?]
> at java.lang.Thread.run(Thread.java:834) ~[?:?]
> {code}
> The reproducer based on ItRebalanceDistributedTest#testThreeQueuedRebalances.
> See exception in the test log:
> {code:java}
> @Test
> void testThreeQueuedRebalances() throws Exception {
> Node node = getNode(0);
> createZone(node, ZONE_NAME, 1, 1);
> createTable(node, ZONE_NAME, TABLE_NAME);
> assertTrue(waitForCondition(() -> getPartitionClusterNodes(node,
> 0).size() == 1, AWAIT_TIMEOUT_MILLIS));
> alterZone(node, ZONE_NAME, 2);
> alterZone(node, ZONE_NAME, 3);
> alterZone(node, ZONE_NAME, 2);
> alterZone(node, ZONE_NAME, 3);
> alterZone(node, ZONE_NAME, 2);
> alterZone(node, ZONE_NAME, 3);
> alterZone(node, ZONE_NAME, 2);
> alterZone(node, ZONE_NAME, 3);
> alterZone(node, ZONE_NAME, 2);
> alterZone(node, ZONE_NAME, 3);
> alterZone(node, ZONE_NAME, 2);
> waitPartitionAssignmentsSyncedToExpected(0, 2);
> checkPartitionNodes(0, 2);
> }
> {code}
> We can fix it by a check if the raft node and the Replica are created before
> startPartitionRaftGroupNode and startReplicaWithNewListener in
> TableManager#handleChangePendingAssignmentEvent.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)