[jira] [Updated] (IGNITE-18428) After a RAFT snapshot install timed out, subsequent installs consistently failed

Roman Puchkovskiy (Jira) Mon, 19 Dec 2022 02:11:08 -0800


     [ 
https://issues.apache.org/jira/browse/IGNITE-18428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Roman Puchkovskiy updated IGNITE-18428:
---------------------------------------
    Attachment: test.log.txt

> After a RAFT snapshot install timed out, subsequent installs consistently 
> failed
> --------------------------------------------------------------------------------
>
>                 Key: IGNITE-18428
>                 URL: https://issues.apache.org/jira/browse/IGNITE-18428
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Roman Puchkovskiy
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>         Attachments: test.log.txt
>
>
> If a RAFT snapshot installation takes more than the corresponding timeout (10 
> seconds in this case), a retry is attempted. The retry, if it finds an 
> ongoing snapshot copier, tries to cancel it, so that on next retry the 
> installation will start over.
> In one run of a test, the initial attempt to install a snapshot failed, but 
> then all subsequent attempts were trying to cancel the installation and none 
> of them was actually starting another copier, so an infinite loop was created.
> Normally, {{onSnapshotLoadDone()}} is invoked even if snapshot load has 
> failed to clean everything up and make next install attempt possible. This 
> clean up includes nullufiying the contents of {{downloadingSnapshot}} in 
> {{{}SnapshotExecutorImpl{}}}. But this time, according to the log, 
> {{onSnapshotLoadDone()}} was never invoked, so the old snapshot was remaining 
> as 'downloading' forever.
> This could something to do with the fact that the {{IncomingSnapshotCopier}} 
> does not set its status as error (with {{{}setError(){}}})  on cancellation 
> as {{LocalSnapshotCopier}} does.
> Also, there could be some race.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-18428) After a RAFT snapshot install timed out, subsequent installs consistently failed

Reply via email to