[jira] [Updated] (FLINK-20465) Fail globally when not resuming from the latest checkpoint in regional failover

Till Rohrmann (Jira) Thu, 03 Dec 2020 01:10:38 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-20465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Till Rohrmann updated FLINK-20465:
----------------------------------
    Description: 
As a follow up for FLINK-20290 we should assert that we resume from the latest 
checkpoint when doing a regional failover in the {{SourceCoordinators}} in 
order to avoid losing input splits (see FLINK-20427). If the assumption does 
not hold, then we should fail the job globally so that we reset the master 
state to a consistent view of the state. Such a behaviour can act as a safety 
net in case that Flink ever tries to recover from not the latest available 
checkpoint.

One idea how to solve it is to remember the latest completed checkpoint id 
somewhere along the way to the 
{{SplitAssignmentTracker.getAndRemoveUncheckpointedAssignment}} and failing 
when the restored checkpoint id is smaller.

cc [~sewen], [~jqin]

  was:
As a follow up for FLINK-20290 we should assert that we resume from the latest 
checkpoint when doing a regional failover in the {{SourceCoordinators}} in 
order to avoid losing input splits (see FLINK-20427). If the assumption does 
not hold, then we should fail the job globally so that we reset the master 
state to a consistent view of the state. Such a behaviour can act as a safety 
net in case that Flink ever tries to recover from not the latest available 
checkpoint.

cc [~sewen], [~jqin]


> Fail globally when not resuming from the latest checkpoint in regional 
> failover
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-20465
>                 URL: https://issues.apache.org/jira/browse/FLINK-20465
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.12.0
>            Reporter: Till Rohrmann
>            Priority: Critical
>             Fix For: 1.13.0, 1.12.1
>
>
> As a follow up for FLINK-20290 we should assert that we resume from the 
> latest checkpoint when doing a regional failover in the 
> {{SourceCoordinators}} in order to avoid losing input splits (see 
> FLINK-20427). If the assumption does not hold, then we should fail the job 
> globally so that we reset the master state to a consistent view of the state. 
> Such a behaviour can act as a safety net in case that Flink ever tries to 
> recover from not the latest available checkpoint.
> One idea how to solve it is to remember the latest completed checkpoint id 
> somewhere along the way to the 
> {{SplitAssignmentTracker.getAndRemoveUncheckpointedAssignment}} and failing 
> when the restored checkpoint id is smaller.
> cc [~sewen], [~jqin]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-20465) Fail globally when not resuming from the latest checkpoint in regional failover

Reply via email to