[ 
https://issues.apache.org/jira/browse/IGNITE-20772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784903#comment-17784903
 ] 

Roman Puchkovskiy commented on IGNITE-20772:
--------------------------------------------

When a node is restarted, it tries to connect SWIM seed members. We only have 
one of them (node0), and the restarted node is node2. For some reason, when 
node2 tries to connect the seed, it fails to do so, as can be seen from the 
following line:

[2023-10-27T08:24:24,742][WARN ][itrst_tsim_2-client-1][MembershipProtocol] 
[default:itrst_tsim_2:[email protected]:3346] Exception on 
initial Sync, cause: java.util.concurrent.CompletionException: 
org.apache.ignite.internal.network.handshake.HandshakeException: Channel has 
been closed before handshake has finished or handshake has failed

> ItTableRaftSnapshotsTest#txSemanticsIsMaintained is flaky in different 
> branches
> -------------------------------------------------------------------------------
>
>                 Key: IGNITE-20772
>                 URL: https://issues.apache.org/jira/browse/IGNITE-20772
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Sergey Chugunov
>            Assignee: Roman Puchkovskiy
>            Priority: Major
>              Labels: ignite-3
>         Attachments: _Integration_Tests_Module_Runner_18692.log.zip
>
>
> Test fails from time to time in different branches, success rate is 98%.
> Latest failure in main branch was caused by timeout in test logic:
> {code:java}
> java.lang.RuntimeException: java.util.concurrent.TimeoutException
>       at org.apache.ignite.internal.Cluster.startNode(Cluster.java:336)
>       at org.apache.ignite.internal.Cluster.startNode(Cluster.java:315)
>       at 
> org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.reanimateNode(ItTableRaftSnapshotsTest.java:453)
>       at 
> org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.reanimateNodeAndWaitForSnapshotInstalled(ItTableRaftSnapshotsTest.java:435)
>       at 
> org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.reanimateNode2AndWaitForSnapshotInstalled(ItTableRaftSnapshotsTest.java:425)
>       at 
> org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.txSemanticsIsMaintainedAfterInstallingSnapshot(ItTableRaftSnapshotsTest.java:494)
>       at 
> org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.txSemanticsIsMaintained(ItTableRaftSnapshotsTest.java:466)
> ...
> Caused by: java.util.concurrent.TimeoutException
>       at 
> java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
>       at 
> java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
>       at org.apache.ignite.internal.Cluster.startNode(Cluster.java:330)
>       ... 93 more {code}
> Test involves node restart so multiple errors in logs about connection issues 
> are expected:
> {code:java}
> [2023-10-27T08:24:44,662][WARN 
> ][%itrst_tsim_2%Raft-Group-Client-4][RaftGroupServiceImpl] Recoverable error 
> during the request type=GetLeaderRequestImpl occurred (will be retried on the 
> randomly selected node): 
> java.util.concurrent.CompletionException: java.net.ConnectException: Peer 
> itrst_tsim_0 is unavailable
>       at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
>  ~[?:?]
>       at 
> java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099)
>  ~[?:?]
>       at 
> java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
>  ~[?:?]
>       at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:523)
>  ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$handleThrowable$40(RaftGroupServiceImpl.java:564)
>  ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
>       at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>  [?:?]
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>       at java.lang.Thread.run(Thread.java:834) [?:?]
> Caused by: java.net.ConnectException: Peer itrst_tsim_0 is unavailable
>       at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.resolvePeer(RaftGroupServiceImpl.java:761)
>  ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:522)
>  ~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
>       ... 7 more {code}
> At the same time it is not clear from logs what prevented node from starting.
> Suite run with failed test is available 
> [here|https://ci.ignite.apache.org/viewLog.html?buildId=7590927&buildTypeId=ApacheIgnite3xGradle_Test_IntegrationTests_ModuleRunner&tab=buildLog],
>  logs are attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to