Roman Puchkovskiy created IGNITE-18428:
------------------------------------------

             Summary: After a RAFT snapshot install timed out, subsequent 
installs consistently failed
                 Key: IGNITE-18428
                 URL: https://issues.apache.org/jira/browse/IGNITE-18428
             Project: Ignite
          Issue Type: Bug
            Reporter: Roman Puchkovskiy
             Fix For: 3.0.0-beta2


If a RAFT snapshot installation takes more than the corresponding timeout (10 
seconds in this case), a retry is attempted. The retry, if it finds an ongoing 
snapshot copier, tries to cancel it, so that on next retry the installation 
will start over.

In one run of a test, the initial attempt to install a snapshot failed, but 
then all subsequent attempts were trying to cancel the installation and none of 
them was actually starting another copier, so an infinite loop was created.

Normally, {{onSnapshotLoadDone()}} is invoked even if snapshot load has failed 
to clean everything up and make next install attempt possible. This clean up 
includes nullufiying the contents of {{downloadingSnapshot}} in 
{{{}SnapshotExecutorImpl{}}}. But this time, according to the log, 
{{onSnapshotLoadDone()}} was never invoked, so the old snapshot was remaining 
as 'downloading' forever.

This could something to do with the fact that the {{IncomingSnapshotCopier}} 
does not set its status as error (with {{{}setError(){}}})  on cancellation as 
{{LocalSnapshotCopier}} does.

Also, there could be some race.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to