[jira] [Updated] (FLINK-23553) Trigger global failover for synchronous savepoints

Dawid Wysakowicz (Jira) Tue, 03 Aug 2021 07:01:17 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-23553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dawid Wysakowicz updated FLINK-23553:
-------------------------------------
    Description: 
We should trigger a global job failover in case of a {{stop-with-savepoint 
--drain}} fails.

The situation is obvious in case of the with drain mode. If a savepoint fails 
we simply can not continue as we have already flushed all data and prepared the 
state for finishing. We can not simply continue processing records.

It is more debatable for without drain mode, where we could theoretically 
continue processing records, however, it is also a good approach to unify the 
two modes.

This task is about triggering the failover on the CheckpointCoordinator. We 
should make sure that if a synchronous checkpoint has been triggered there will 
be no newere checkpoints scheduled. 

If a synchronous savepoint fails for whatever reason we should trigger a global 
failover for the job.

We might add a safety guards  (checkState calls for situations we missed on the 
Task in a follow-up ticket)

  was:
We should trigger a global job failover in case of a {{stop-with-savepoint 
--drain}} fails.

The situation is obvious in case of the with drain mode. If a savepoint fails 
we simply can not continue as we have already flushed all data and prepared the 
state for finishing. We can not simply continue processing records.

It is more debatable for without drain mode, where we could theoretically 
continue processing records, however, it is also a good approach to unify the 
two modes.

We can issue a global failover on the {{CheckpointCoordinator}}


> Trigger global failover for synchronous savepoints
> --------------------------------------------------
>
>                 Key: FLINK-23553
>                 URL: https://issues.apache.org/jira/browse/FLINK-23553
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.3, 1.13.1, 1.12.4
>            Reporter: Dawid Wysakowicz
>            Priority: Major
>             Fix For: 1.14.0
>
>
> We should trigger a global job failover in case of a {{stop-with-savepoint 
> --drain}} fails.
> The situation is obvious in case of the with drain mode. If a savepoint fails 
> we simply can not continue as we have already flushed all data and prepared 
> the state for finishing. We can not simply continue processing records.
> It is more debatable for without drain mode, where we could theoretically 
> continue processing records, however, it is also a good approach to unify the 
> two modes.
> This task is about triggering the failover on the CheckpointCoordinator. We 
> should make sure that if a synchronous checkpoint has been triggered there 
> will be no newere checkpoints scheduled. 
> If a synchronous savepoint fails for whatever reason we should trigger a 
> global failover for the job.
> We might add a safety guards  (checkState calls for situations we missed on 
> the Task in a follow-up ticket)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-23553) Trigger global failover for synchronous savepoints

Reply via email to