[jira] [Resolved] (IGNITE-20390) ItTableRaftSnapshotsTest.snapshotInstallationRepeatsOnTimeout became flaky

Roman Puchkovskiy (Jira) Fri, 10 Nov 2023 03:35:08 -0800


     [ 
https://issues.apache.org/jira/browse/IGNITE-20390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Roman Puchkovskiy resolved IGNITE-20390.
----------------------------------------
    Resolution: Duplicate

This has same root cause as IGNITE-20772: when a node gets restarted, sometimes 
an exception happens during a handshake which seems to prevent the connectivity 
of this node to other nodes:

[2023-09-08T00:10:18,847][WARN ][itrst_sirot_2-client-1][MembershipProtocol] 
[default:itrst_sirot_2:[email protected]:3346] Exception on 
initial Sync, cause: java.util.concurrent.CompletionException: 
org.apache.ignite.internal.network.handshake.HandshakeException: Channel has 
been closed before handshake has finished or handshake has failed

> ItTableRaftSnapshotsTest.snapshotInstallationRepeatsOnTimeout became flaky 
> ---------------------------------------------------------------------------
>
>                 Key: IGNITE-20390
>                 URL: https://issues.apache.org/jira/browse/IGNITE-20390
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mirza Aliev
>            Priority: Major
>              Labels: ignite-3
>         Attachments: _Integration_Tests_Module_Runner_16745.log.zip
>
>
> {{ItTableRaftSnapshotsTest.snapshotInstallationRepeatsOnTimeout}} became 
> flaky on the main 
> https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests?branch=%3Cdefault%3E&mode=builds&expandBuildProblemsSection=true&hideProblemsFromDependencies=false&expandBuildTestsSection=true&hideTestsFromDependencies=false#7490330
> I see from logs that node, that we try to restart, cannot join cluster. It 
> happens because it cannot resolve peer of the leader when it tries to start 
> CMG raft group service, {{getByConsistentId}} returns null.
> {code:java}
> ClusterNode node = 
> cluster.topologyService().getByConsistentId(peer.consistentId());
> {code}
> Also I see from the leader's log, that this restarted node is removed from 
> the topology
> {noformat}
> Node left [member=ClusterNodeImpl [id=64e75771-1f64-47d7-866f-14087ab182fa, 
> name=itrst_sirot_2, address=127.0.0.1:3346, nodeMetadata=null], 
> eventType=REMOVED]
> {noformat}
> When test pass successfully, we see logs like this 
> {noformat}
> Node left (noop as it has already reappeared) [member=ClusterNodeImpl 
> [id=52a3656a-5bc6-4b20-8796-b92971ae84a4, name=itrst_sirot_2, 
> address=127.0.0.1:3346, nodeMetadata=null], eventType=REMOVED]
> {noformat}
> The difference is that successful run contains this {{(noop as it has already 
> reappeared)}}. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (IGNITE-20390) ItTableRaftSnapshotsTest.snapshotInstallationRepeatsOnTimeout became flaky

Reply via email to