[ 
https://issues.apache.org/jira/browse/IGNITE-20640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladislav Pyatkov updated IGNITE-20640:
---------------------------------------
    Description: 
Due to nodes starting simultaneously, tests may have several rebalances at the 
start. After the rebalance is finished, the list of peers for the raft 
replication group can be different. The changed list of peers should apply to 
RAFT clients, but it does not happen.

The method (InternalTableImpl#updateInternalTableRaftGroupService) updates 
clients only on table start and does not consider a further rebalance. 
Currently, we try to send a raft command but receive a timeout exception 
because the leader is absent from the list of peers:

  was:
This behavior leads to getting stuck in any RAFT operation because the leader 
cannot be elected.
{noformat}
[2023-10-10T16:48:48,771][INFO ][%node1%tableManager-io-3][Loza] Start new raft 
node=RaftNodeId [groupId=3_part_15, peer=Peer [consistentId=node1, idx=0]] with 
initial configuration=PeersAndLearners [peers=Set12 [Peer [consistentId=node2, 
idx=0]], learners=SetN []]
{noformat}
This issue is reproduced in the test 
ItDataSchemaSyncTest#checkSchemasCorrectlyRestore, to test it in a log just add 
an assertion:

{code:title=Loza#startRaftGroupNodeInternal}
assert configuration.peers().contains(nodeId.peer()) || configuration.learners()
                .contains(nodeId.peer()) : "Raft node started on a peer where 
it should not be";
{code}
{noformat}
[2023-10-10T20:51:51,154][ERROR][%node0%tableManager-io-11][WatchProcessor] 
Error occurred when processing a watch event
 java.lang.AssertionError: Raft node started on a peer where it should not be
    at 
org.apache.ignite.internal.raft.Loza.startRaftGroupNodeInternal(Loza.java:361) 
~[main/:?]
    at org.apache.ignite.internal.raft.Loza.startRaftGroupNode(Loza.java:252) 
~[main/:?]
    at org.apache.ignite.internal.raft.Loza.startRaftGroupNode(Loza.java:225) 
~[main/:?]
    at 
org.apache.ignite.internal.table.distributed.TableManager.startPartitionRaftGroupNode(TableManager.java:1986)
 ~[main/:?]
    at 
org.apache.ignite.internal.table.distributed.TableManager.lambda$handleChangePendingAssignmentEvent$90(TableManager.java:1878)
 ~[main/:?]
    at 
org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:805) 
~[main/:?]
    at 
org.apache.ignite.internal.table.distributed.TableManager.lambda$handleChangePendingAssignmentEvent$91(TableManager.java:1848)
 ~[main/:?]
    at 
java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:783)
 [?:?]
    at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
 [?:?]
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]
{noformat}


> Raft node started in a node where it should not be
> --------------------------------------------------
>
>                 Key: IGNITE-20640
>                 URL: https://issues.apache.org/jira/browse/IGNITE-20640
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Vladislav Pyatkov
>            Priority: Major
>
> Due to nodes starting simultaneously, tests may have several rebalances at 
> the start. After the rebalance is finished, the list of peers for the raft 
> replication group can be different. The changed list of peers should apply to 
> RAFT clients, but it does not happen.
> The method (InternalTableImpl#updateInternalTableRaftGroupService) updates 
> clients only on table start and does not consider a further rebalance. 
> Currently, we try to send a raft command but receive a timeout exception 
> because the leader is absent from the list of peers:



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to