Roman Puchkovskiy created IGNITE-18428:
------------------------------------------
Summary: After a RAFT snapshot install timed out, subsequent
installs consistently failed
Key: IGNITE-18428
URL: https://issues.apache.org/jira/browse/IGNITE-18428
Project: Ignite
Issue Type: Bug
Reporter: Roman Puchkovskiy
Fix For: 3.0.0-beta2
If a RAFT snapshot installation takes more than the corresponding timeout (10
seconds in this case), a retry is attempted. The retry, if it finds an ongoing
snapshot copier, tries to cancel it, so that on next retry the installation
will start over.
In one run of a test, the initial attempt to install a snapshot failed, but
then all subsequent attempts were trying to cancel the installation and none of
them was actually starting another copier, so an infinite loop was created.
Normally, {{onSnapshotLoadDone()}} is invoked even if snapshot load has failed
to clean everything up and make next install attempt possible. This clean up
includes nullufiying the contents of {{downloadingSnapshot}} in
{{{}SnapshotExecutorImpl{}}}. But this time, according to the log,
{{onSnapshotLoadDone()}} was never invoked, so the old snapshot was remaining
as 'downloading' forever.
This could something to do with the fact that the {{IncomingSnapshotCopier}}
does not set its status as error (with {{{}setError(){}}}) on cancellation as
{{LocalSnapshotCopier}} does.
Also, there could be some race.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)