[
https://issues.apache.org/jira/browse/IGNITE-20772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784903#comment-17784903
]
Roman Puchkovskiy commented on IGNITE-20772:
--------------------------------------------
When a node is restarted, it tries to connect SWIM seed members. We only have
one of them (node0), and the restarted node is node2. For some reason, when
node2 tries to connect the seed, it fails to do so, as can be seen from the
following line:
[2023-10-27T08:24:24,742][WARN ][itrst_tsim_2-client-1][MembershipProtocol]
[default:itrst_tsim_2:[email protected]:3346] Exception on
initial Sync, cause: java.util.concurrent.CompletionException:
org.apache.ignite.internal.network.handshake.HandshakeException: Channel has
been closed before handshake has finished or handshake has failed
> ItTableRaftSnapshotsTest#txSemanticsIsMaintained is flaky in different
> branches
> -------------------------------------------------------------------------------
>
> Key: IGNITE-20772
> URL: https://issues.apache.org/jira/browse/IGNITE-20772
> Project: Ignite
> Issue Type: Bug
> Reporter: Sergey Chugunov
> Assignee: Roman Puchkovskiy
> Priority: Major
> Labels: ignite-3
> Attachments: _Integration_Tests_Module_Runner_18692.log.zip
>
>
> Test fails from time to time in different branches, success rate is 98%.
> Latest failure in main branch was caused by timeout in test logic:
> {code:java}
> java.lang.RuntimeException: java.util.concurrent.TimeoutException
> at org.apache.ignite.internal.Cluster.startNode(Cluster.java:336)
> at org.apache.ignite.internal.Cluster.startNode(Cluster.java:315)
> at
> org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.reanimateNode(ItTableRaftSnapshotsTest.java:453)
> at
> org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.reanimateNodeAndWaitForSnapshotInstalled(ItTableRaftSnapshotsTest.java:435)
> at
> org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.reanimateNode2AndWaitForSnapshotInstalled(ItTableRaftSnapshotsTest.java:425)
> at
> org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.txSemanticsIsMaintainedAfterInstallingSnapshot(ItTableRaftSnapshotsTest.java:494)
> at
> org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.txSemanticsIsMaintained(ItTableRaftSnapshotsTest.java:466)
> ...
> Caused by: java.util.concurrent.TimeoutException
> at
> java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
> at
> java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
> at org.apache.ignite.internal.Cluster.startNode(Cluster.java:330)
> ... 93 more {code}
> Test involves node restart so multiple errors in logs about connection issues
> are expected:
> {code:java}
> [2023-10-27T08:24:44,662][WARN
> ][%itrst_tsim_2%Raft-Group-Client-4][RaftGroupServiceImpl] Recoverable error
> during the request type=GetLeaderRequestImpl occurred (will be retried on the
> randomly selected node):
> java.util.concurrent.CompletionException: java.net.ConnectException: Peer
> itrst_tsim_0 is unavailable
> at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
> ~[?:?]
> at
> java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099)
> ~[?:?]
> at
> java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
> ~[?:?]
> at
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:523)
> ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$handleThrowable$40(RaftGroupServiceImpl.java:564)
> ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
> at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
> [?:?]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> [?:?]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> [?:?]
> at java.lang.Thread.run(Thread.java:834) [?:?]
> Caused by: java.net.ConnectException: Peer itrst_tsim_0 is unavailable
> at
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.resolvePeer(RaftGroupServiceImpl.java:761)
> ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:522)
> ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
> ... 7 more {code}
> At the same time it is not clear from logs what prevented node from starting.
> Suite run with failed test is available
> [here|https://ci.ignite.apache.org/viewLog.html?buildId=7590927&buildTypeId=ApacheIgnite3xGradle_Test_IntegrationTests_ModuleRunner&tab=buildLog],
> logs are attached.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)