[
https://issues.apache.org/jira/browse/IGNITE-18428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roman Puchkovskiy updated IGNITE-18428:
---------------------------------------
Attachment: test.log.txt
> After a RAFT snapshot install timed out, subsequent installs consistently
> failed
> --------------------------------------------------------------------------------
>
> Key: IGNITE-18428
> URL: https://issues.apache.org/jira/browse/IGNITE-18428
> Project: Ignite
> Issue Type: Bug
> Reporter: Roman Puchkovskiy
> Priority: Major
> Labels: ignite-3
> Fix For: 3.0.0-beta2
>
> Attachments: test.log.txt
>
>
> If a RAFT snapshot installation takes more than the corresponding timeout (10
> seconds in this case), a retry is attempted. The retry, if it finds an
> ongoing snapshot copier, tries to cancel it, so that on next retry the
> installation will start over.
> In one run of a test, the initial attempt to install a snapshot failed, but
> then all subsequent attempts were trying to cancel the installation and none
> of them was actually starting another copier, so an infinite loop was created.
> Normally, {{onSnapshotLoadDone()}} is invoked even if snapshot load has
> failed to clean everything up and make next install attempt possible. This
> clean up includes nullufiying the contents of {{downloadingSnapshot}} in
> {{{}SnapshotExecutorImpl{}}}. But this time, according to the log,
> {{onSnapshotLoadDone()}} was never invoked, so the old snapshot was remaining
> as 'downloading' forever.
> This could something to do with the fact that the {{IncomingSnapshotCopier}}
> does not set its status as error (with {{{}setError(){}}}) on cancellation
> as {{LocalSnapshotCopier}} does.
> Also, there could be some race.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)