[
https://issues.apache.org/jira/browse/IGNITE-20640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vladislav Pyatkov updated IGNITE-20640:
---------------------------------------
Description:
Due to nodes starting simultaneously, tests may have several rebalances at the
start. After the rebalance is finished, the list of peers for the raft
replication group can be different. The changed list of peers should apply to
RAFT clients, but it does not happen.
The method (InternalTableImpl#updateInternalTableRaftGroupService) updates
clients only on table start and does not consider a further rebalance.
Currently, we try to send a raft command but receive a timeout exception
because the leader is absent from the list of peers:
was:
This behavior leads to getting stuck in any RAFT operation because the leader
cannot be elected.
{noformat}
[2023-10-10T16:48:48,771][INFO ][%node1%tableManager-io-3][Loza] Start new raft
node=RaftNodeId [groupId=3_part_15, peer=Peer [consistentId=node1, idx=0]] with
initial configuration=PeersAndLearners [peers=Set12 [Peer [consistentId=node2,
idx=0]], learners=SetN []]
{noformat}
This issue is reproduced in the test
ItDataSchemaSyncTest#checkSchemasCorrectlyRestore, to test it in a log just add
an assertion:
{code:title=Loza#startRaftGroupNodeInternal}
assert configuration.peers().contains(nodeId.peer()) || configuration.learners()
.contains(nodeId.peer()) : "Raft node started on a peer where
it should not be";
{code}
{noformat}
[2023-10-10T20:51:51,154][ERROR][%node0%tableManager-io-11][WatchProcessor]
Error occurred when processing a watch event
java.lang.AssertionError: Raft node started on a peer where it should not be
at
org.apache.ignite.internal.raft.Loza.startRaftGroupNodeInternal(Loza.java:361)
~[main/:?]
at org.apache.ignite.internal.raft.Loza.startRaftGroupNode(Loza.java:252)
~[main/:?]
at org.apache.ignite.internal.raft.Loza.startRaftGroupNode(Loza.java:225)
~[main/:?]
at
org.apache.ignite.internal.table.distributed.TableManager.startPartitionRaftGroupNode(TableManager.java:1986)
~[main/:?]
at
org.apache.ignite.internal.table.distributed.TableManager.lambda$handleChangePendingAssignmentEvent$90(TableManager.java:1878)
~[main/:?]
at
org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:805)
~[main/:?]
at
org.apache.ignite.internal.table.distributed.TableManager.lambda$handleChangePendingAssignmentEvent$91(TableManager.java:1848)
~[main/:?]
at
java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:783)
[?:?]
at
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
[?:?]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[?:?]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
{noformat}
> Raft node started in a node where it should not be
> --------------------------------------------------
>
> Key: IGNITE-20640
> URL: https://issues.apache.org/jira/browse/IGNITE-20640
> Project: Ignite
> Issue Type: Bug
> Reporter: Vladislav Pyatkov
> Priority: Major
>
> Due to nodes starting simultaneously, tests may have several rebalances at
> the start. After the rebalance is finished, the list of peers for the raft
> replication group can be different. The changed list of peers should apply to
> RAFT clients, but it does not happen.
> The method (InternalTableImpl#updateInternalTableRaftGroupService) updates
> clients only on table start and does not consider a further rebalance.
> Currently, we try to send a raft command but receive a timeout exception
> because the leader is absent from the list of peers:
--
This message was sent by Atlassian Jira
(v8.20.10#820010)