[
https://issues.apache.org/jira/browse/IGNITE-20640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kirill Gusakov updated IGNITE-20640:
------------------------------------
Description:
*Motivation*
Due to nodes starting simultaneously, tests may have several rebalances at the
start. After the rebalance is finished, the list of peers for the raft
replication group can be different. The changed list of peers should apply to
RAFT clients, but it does not happen.
The method (InternalTableImpl#updateInternalTableRaftGroupService) updates
clients only on table start and does not consider a further rebalance.
Currently, we will try to send a raft command but receive a timeout exception
because the leader is absent from the list of peers (in the case of (node1) ->
(node2) rebalance with disjoint list of old/new peers):
{noformat}
java.util.concurrent.CompletionException: java.net.ConnectException: Peer
irdt_ttqr_20000 is unavailable
at
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
~[?:?]
at
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099)
~[?:?]
at
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
~[?:?]
at
org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:530)
~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:497)
~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.raft.RaftGroupServiceImpl.run(RaftGroupServiceImpl.java:456)
~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.raft.client.TopologyAwareRaftGroupService.run(TopologyAwareRaftGroupService.java:423)
~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processReplicaSafeTimeSyncRequest(PartitionReplicaListener.java:854)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:588)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processRequest$16(PartitionReplicaListener.java:470)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
~[?:?]
at
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
~[?:?]
at
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processRequest(PartitionReplicaListener.java:470)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$invoke$12(PartitionReplicaListener.java:449)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
~[?:?]
at
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
~[?:?]
at
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.invoke(PartitionReplicaListener.java:449)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.replicator.Replica.processRequest(Replica.java:139)
~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.replicator.ReplicaManager.lambda$idleSafeTimeSync$16(ReplicaManager.java:692)
~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
at
java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4772)
~[?:?]
at
org.apache.ignite.internal.replicator.ReplicaManager.idleSafeTimeSync(ReplicaManager.java:686)
~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
[?:?]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
[?:?]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[?:?]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.net.ConnectException: Peer irdt_ttqr_20000 is unavailable
at
org.apache.ignite.internal.raft.RaftGroupServiceImpl.resolvePeer(RaftGroupServiceImpl.java:768)
~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:529)
~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
... 23 more
{noformat}
*Difinition of done*
After rebalancing the RAFT clients must be update with appropriate peers list.
was:
*Motivation*
Due to nodes starting simultaneously, tests may have several rebalances at the
start. After the rebalance is finished, the list of peers for the raft
replication group can be different. The changed list of peers should apply to
RAFT clients, but it does not happen.
The method (InternalTableImpl#updateInternalTableRaftGroupService) updates
clients only on table start and does not consider a further rebalance.
Currently, we try to send a raft command but receive a timeout exception
because the leader is absent from the list of peers:
{noformat}
java.util.concurrent.CompletionException: java.net.ConnectException: Peer
irdt_ttqr_20000 is unavailable
at
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
~[?:?]
at
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099)
~[?:?]
at
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
~[?:?]
at
org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:530)
~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:497)
~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.raft.RaftGroupServiceImpl.run(RaftGroupServiceImpl.java:456)
~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.raft.client.TopologyAwareRaftGroupService.run(TopologyAwareRaftGroupService.java:423)
~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processReplicaSafeTimeSyncRequest(PartitionReplicaListener.java:854)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:588)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processRequest$16(PartitionReplicaListener.java:470)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
~[?:?]
at
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
~[?:?]
at
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processRequest(PartitionReplicaListener.java:470)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$invoke$12(PartitionReplicaListener.java:449)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
~[?:?]
at
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
~[?:?]
at
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.invoke(PartitionReplicaListener.java:449)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.replicator.Replica.processRequest(Replica.java:139)
~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.replicator.ReplicaManager.lambda$idleSafeTimeSync$16(ReplicaManager.java:692)
~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
at
java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4772)
~[?:?]
at
org.apache.ignite.internal.replicator.ReplicaManager.idleSafeTimeSync(ReplicaManager.java:686)
~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
[?:?]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
[?:?]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[?:?]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.net.ConnectException: Peer irdt_ttqr_20000 is unavailable
at
org.apache.ignite.internal.raft.RaftGroupServiceImpl.resolvePeer(RaftGroupServiceImpl.java:768)
~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:529)
~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
... 23 more
{noformat}
*Difinition of done*
After rebalancing the RAFT clients must be update with appropriate peers list.
> RAFT client does not change peers after rebalance
> -------------------------------------------------
>
> Key: IGNITE-20640
> URL: https://issues.apache.org/jira/browse/IGNITE-20640
> Project: Ignite
> Issue Type: Bug
> Reporter: Vladislav Pyatkov
> Assignee: Kirill Gusakov
> Priority: Blocker
> Labels: ignite-3
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> *Motivation*
> Due to nodes starting simultaneously, tests may have several rebalances at
> the start. After the rebalance is finished, the list of peers for the raft
> replication group can be different. The changed list of peers should apply to
> RAFT clients, but it does not happen.
> The method (InternalTableImpl#updateInternalTableRaftGroupService) updates
> clients only on table start and does not consider a further rebalance.
> Currently, we will try to send a raft command but receive a timeout exception
> because the leader is absent from the list of peers (in the case of (node1)
> -> (node2) rebalance with disjoint list of old/new peers):
> {noformat}
> java.util.concurrent.CompletionException: java.net.ConnectException: Peer
> irdt_ttqr_20000 is unavailable
> at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
> ~[?:?]
> at
> java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099)
> ~[?:?]
> at
> java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
> ~[?:?]
> at
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:530)
> ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:497)
> ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.run(RaftGroupServiceImpl.java:456)
> ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.raft.client.TopologyAwareRaftGroupService.run(TopologyAwareRaftGroupService.java:423)
> ~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processReplicaSafeTimeSyncRequest(PartitionReplicaListener.java:854)
> ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processOperationRequest(PartitionReplicaListener.java:588)
> ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processRequest$16(PartitionReplicaListener.java:470)
> ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
> at
> java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
> ~[?:?]
> at
> java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
> ~[?:?]
> at
> org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.processRequest(PartitionReplicaListener.java:470)
> ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$invoke$12(PartitionReplicaListener.java:449)
> ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
> at
> java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
> ~[?:?]
> at
> java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
> ~[?:?]
> at
> org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.invoke(PartitionReplicaListener.java:449)
> ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.replicator.Replica.processRequest(Replica.java:139)
> ~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.replicator.ReplicaManager.lambda$idleSafeTimeSync$16(ReplicaManager.java:692)
> ~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
> at
> java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4772)
> ~[?:?]
> at
> org.apache.ignite.internal.replicator.ReplicaManager.idleSafeTimeSync(ReplicaManager.java:686)
> ~[ignite-replicator-3.0.0-SNAPSHOT.jar:?]
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
> [?:?]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
> [?:?]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> [?:?]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> [?:?]
> at java.lang.Thread.run(Thread.java:834) [?:?]
> Caused by: java.net.ConnectException: Peer irdt_ttqr_20000 is unavailable
> at
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.resolvePeer(RaftGroupServiceImpl.java:768)
> ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:529)
> ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
> ... 23 more
> {noformat}
> *Difinition of done*
> After rebalancing the RAFT clients must be update with appropriate peers list.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)