[ 
https://issues.apache.org/jira/browse/IGNITE-20640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirill Gusakov updated IGNITE-20640:
------------------------------------
    Description: 
*Motivation*

Due to nodes starting simultaneously, tests may have several rebalances at the 
start. After the rebalance is finished, the list of peers for the raft 
replication group can be different. The changed list of peers should apply to 
RAFT clients, but it does not happen.

The method (InternalTableImpl#updateInternalTableRaftGroupService) updates 
clients only on table start and does not consider a further rebalance. 
Currently, we try to send a raft command but receive a timeout exception 
because the leader is absent from the list of peers:
{noformat}
java.util.concurrent.CompletionException: java.net.ConnectException: Peer 
irdt_ttqr_20000 is unavailable
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
 ~[?:?]
        at 
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099)
 ~[?:?]
        at 
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) 
~[?:?]
        at 
org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:530)
 ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
        at 
org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:497)
 ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
        at 
org.apache.ignite.internal.raft.RaftGroupServiceImpl.run(RaftGroupServiceImpl.java:456)
 ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
        at 
org.apache.ignite.internal.raft.client.TopologyAwareRaftGroupService.run(TopologyAwareRaftGroupService.java:423)
 ~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
        at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processReplicaSafeTimeSyncRequest(PartitionReplicaListener.java:854)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
        at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:588)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
        at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processRequest$16(PartitionReplicaListener.java:470)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
        at 
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
 ~[?:?]
        at 
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) 
~[?:?]
        at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processRequest(PartitionReplicaListener.java:470)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
        at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$invoke$12(PartitionReplicaListener.java:449)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
        at 
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
 ~[?:?]
        at 
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) 
~[?:?]
        at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.invoke(PartitionReplicaListener.java:449)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
        at 
org.apache.ignite.internal.replicator.Replica.processRequest(Replica.java:139) 
~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
        at 
org.apache.ignite.internal.replicator.ReplicaManager.lambda$idleSafeTimeSync$16(ReplicaManager.java:692)
 ~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
        at 
java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4772)
 ~[?:?]
        at 
org.apache.ignite.internal.replicator.ReplicaManager.idleSafeTimeSync(ReplicaManager.java:686)
 ~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) 
[?:?]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
 [?:?]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.net.ConnectException: Peer irdt_ttqr_20000 is unavailable
        at 
org.apache.ignite.internal.raft.RaftGroupServiceImpl.resolvePeer(RaftGroupServiceImpl.java:768)
 ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
        at 
org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:529)
 ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
        ... 23 more
{noformat}

*Difinition of done*

After rebalancing the RAFT clients must be update with appropriate peers list.

  was:
*Motivation*

Due to nodes starting simultaneously, tests may have several rebalances at the 
start. After the rebalance is finished, the list of peers for the raft 
replication group can be different. The changed list of peers should apply to 
RAFT clients, but it does not happen.

The method (InternalTableImpl#updateInternalTableRaftGroupService) updates 
clients only on table start and does not consider a further rebalance. 
Currently, we try to send a raft command but receive a timeout exception 
because the leader is absent from the list of peers:
{noformat}
java.util.concurrent.CompletionException: java.net.ConnectException: Peer 
irdt_ttqr_20000 is unavailable
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
 ~[?:?]
        at 
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099)
 ~[?:?]
        at 
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) 
~[?:?]
        at 
org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:530)
 ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
        at 
org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:497)
 ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
        at 
org.apache.ignite.internal.raft.RaftGroupServiceImpl.run(RaftGroupServiceImpl.java:456)
 ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
        at 
org.apache.ignite.internal.raft.client.TopologyAwareRaftGroupService.run(TopologyAwareRaftGroupService.java:423)
 ~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
        at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processReplicaSafeTimeSyncRequest(PartitionReplicaListener.java:854)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
        at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:588)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
        at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processRequest$16(PartitionReplicaListener.java:470)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
        at 
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
 ~[?:?]
        at 
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) 
~[?:?]
        at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processRequest(PartitionReplicaListener.java:470)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
        at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$invoke$12(PartitionReplicaListener.java:449)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
        at 
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
 ~[?:?]
        at 
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) 
~[?:?]
        at 
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.invoke(PartitionReplicaListener.java:449)
 ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
        at 
org.apache.ignite.internal.replicator.Replica.processRequest(Replica.java:139) 
~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
        at 
org.apache.ignite.internal.replicator.ReplicaManager.lambda$idleSafeTimeSync$16(ReplicaManager.java:692)
 ~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
        at 
java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4772)
 ~[?:?]
        at 
org.apache.ignite.internal.replicator.ReplicaManager.idleSafeTimeSync(ReplicaManager.java:686)
 ~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) 
[?:?]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
 [?:?]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.net.ConnectException: Peer irdt_ttqr_20000 is unavailable
        at 
org.apache.ignite.internal.raft.RaftGroupServiceImpl.resolvePeer(RaftGroupServiceImpl.java:768)
 ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
        at 
org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:529)
 ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
        ... 23 more
{noformat}

*Difinition of done*

After rebalancing the non-interaction peer list, the list should be updated and 
RAFT commands applied.


> RAFT client does not change peers after rebalance
> -------------------------------------------------
>
>                 Key: IGNITE-20640
>                 URL: https://issues.apache.org/jira/browse/IGNITE-20640
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Vladislav Pyatkov
>            Assignee: Kirill Gusakov
>            Priority: Blocker
>              Labels: ignite-3
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> *Motivation*
> Due to nodes starting simultaneously, tests may have several rebalances at 
> the start. After the rebalance is finished, the list of peers for the raft 
> replication group can be different. The changed list of peers should apply to 
> RAFT clients, but it does not happen.
> The method (InternalTableImpl#updateInternalTableRaftGroupService) updates 
> clients only on table start and does not consider a further rebalance. 
> Currently, we try to send a raft command but receive a timeout exception 
> because the leader is absent from the list of peers:
> {noformat}
> java.util.concurrent.CompletionException: java.net.ConnectException: Peer 
> irdt_ttqr_20000 is unavailable
>       at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
>  ~[?:?]
>       at 
> java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099)
>  ~[?:?]
>       at 
> java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
>  ~[?:?]
>       at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:530)
>  ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:497)
>  ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.run(RaftGroupServiceImpl.java:456)
>  ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.raft.client.TopologyAwareRaftGroupService.run(TopologyAwareRaftGroupService.java:423)
>  ~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processReplicaSafeTimeSyncRequest(PartitionReplicaListener.java:854)
>  ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:588)
>  ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processRequest$16(PartitionReplicaListener.java:470)
>  ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
>       at 
> java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
>  ~[?:?]
>       at 
> java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
>  ~[?:?]
>       at 
> org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processRequest(PartitionReplicaListener.java:470)
>  ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$invoke$12(PartitionReplicaListener.java:449)
>  ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
>       at 
> java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
>  ~[?:?]
>       at 
> java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
>  ~[?:?]
>       at 
> org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.invoke(PartitionReplicaListener.java:449)
>  ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.replicator.Replica.processRequest(Replica.java:139)
>  ~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.replicator.ReplicaManager.lambda$idleSafeTimeSync$16(ReplicaManager.java:692)
>  ~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
>       at 
> java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4772)
>  ~[?:?]
>       at 
> org.apache.ignite.internal.replicator.ReplicaManager.idleSafeTimeSync(ReplicaManager.java:686)
>  ~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
>       at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) 
> [?:?]
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
>  [?:?]
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>       at java.lang.Thread.run(Thread.java:834) [?:?]
> Caused by: java.net.ConnectException: Peer irdt_ttqr_20000 is unavailable
>       at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.resolvePeer(RaftGroupServiceImpl.java:768)
>  ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:529)
>  ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
>       ... 23 more
> {noformat}
> *Difinition of done*
> After rebalancing the RAFT clients must be update with appropriate peers list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to