[
https://issues.apache.org/jira/browse/IGNITE-20390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roman Puchkovskiy resolved IGNITE-20390.
----------------------------------------
Resolution: Duplicate
This has same root cause as IGNITE-20772: when a node gets restarted, sometimes
an exception happens during a handshake which seems to prevent the connectivity
of this node to other nodes:
[2023-09-08T00:10:18,847][WARN ][itrst_sirot_2-client-1][MembershipProtocol]
[default:itrst_sirot_2:[email protected]:3346] Exception on
initial Sync, cause: java.util.concurrent.CompletionException:
org.apache.ignite.internal.network.handshake.HandshakeException: Channel has
been closed before handshake has finished or handshake has failed
> ItTableRaftSnapshotsTest.snapshotInstallationRepeatsOnTimeout became flaky
> ---------------------------------------------------------------------------
>
> Key: IGNITE-20390
> URL: https://issues.apache.org/jira/browse/IGNITE-20390
> Project: Ignite
> Issue Type: Bug
> Reporter: Mirza Aliev
> Priority: Major
> Labels: ignite-3
> Attachments: _Integration_Tests_Module_Runner_16745.log.zip
>
>
> {{ItTableRaftSnapshotsTest.snapshotInstallationRepeatsOnTimeout}} became
> flaky on the main
> https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests?branch=%3Cdefault%3E&mode=builds&expandBuildProblemsSection=true&hideProblemsFromDependencies=false&expandBuildTestsSection=true&hideTestsFromDependencies=false#7490330
> I see from logs that node, that we try to restart, cannot join cluster. It
> happens because it cannot resolve peer of the leader when it tries to start
> CMG raft group service, {{getByConsistentId}} returns null.
> {code:java}
> ClusterNode node =
> cluster.topologyService().getByConsistentId(peer.consistentId());
> {code}
> Also I see from the leader's log, that this restarted node is removed from
> the topology
> {noformat}
> Node left [member=ClusterNodeImpl [id=64e75771-1f64-47d7-866f-14087ab182fa,
> name=itrst_sirot_2, address=127.0.0.1:3346, nodeMetadata=null],
> eventType=REMOVED]
> {noformat}
> When test pass successfully, we see logs like this
> {noformat}
> Node left (noop as it has already reappeared) [member=ClusterNodeImpl
> [id=52a3656a-5bc6-4b20-8796-b92971ae84a4, name=itrst_sirot_2,
> address=127.0.0.1:3346, nodeMetadata=null], eventType=REMOVED]
> {noformat}
> The difference is that successful run contains this {{(noop as it has already
> reappeared)}}.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)