[jira] [Updated] (FLINK-15731) Stop while Checkpoint is In-Progress Triggers Job Failover

Konstantin Knauf (Jira) Wed, 22 Jan 2020 08:32:53 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Konstantin Knauf updated FLINK-15731:
-------------------------------------
    Description: 
Currently, when a Job is {{stopped}} in-progress checkpoints are aborted and 
afterwards a synchronous savepoint is started.

Since the number of tolerable checkpoint failures is 0 per default (see 
{{org.apache.flink.streaming.api.environment.CheckpointConfig#getTolerableCheckpointFailureNumber}}),
 this triggers a restart of the job if there are any ongoing checkpoints. 

In consequence, the stop call only triggers a failover of the job instead of 
stopping the job, if there is an ongoing checkpoint (or savepoint). 

Possible options I see are: 

a) change default of tolerable checkpoint failures to at least the max number 
of concurrent checkpoints
b) do not count checkpoint failures due to the stop action when checking 
against tolerable checkpoint failures
c) do not abort pending checkpoints when stopping a job, but queue the 
synchronous savepoint after all current in-progress checkpoints



  was:
Currently, when a Job is {{stopped}} in-progress checkpoints are aborted and 
afterwards a synchronous savepoint is started.

Since the number of tolerable checkpoint failures is 0 per default (see 
{{org.apache.flink.streaming.api.environment.CheckpointConfig#getTolerableCheckpointFailureNumber}}),
 this triggers a restart of the job if there are any ongoing checkpoints. 

In consequence, the stop call only triggers a failover of the job instead of 
stopping the job, if there is an ongoing checkpoint (or savepoint). 

Possible Options would be: 

a) change default of tolerable checkpoint failures to at least the max number 
of concurrent checkpoints
b) do not count checkpoint failures due to the stop action when checking 
against tolerable checkpoint failures
c) do not abort pending checkpoints when stopping a job, but queue the 
synchronous savepoint after all current in-progress checkpoints




> Stop while Checkpoint is In-Progress Triggers Job Failover
> ----------------------------------------------------------
>
>                 Key: FLINK-15731
>                 URL: https://issues.apache.org/jira/browse/FLINK-15731
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.9.1, 1.10.0
>            Reporter: Konstantin Knauf
>            Priority: Critical
>
> Currently, when a Job is {{stopped}} in-progress checkpoints are aborted and 
> afterwards a synchronous savepoint is started.
> Since the number of tolerable checkpoint failures is 0 per default (see 
> {{org.apache.flink.streaming.api.environment.CheckpointConfig#getTolerableCheckpointFailureNumber}}),
>  this triggers a restart of the job if there are any ongoing checkpoints. 
> In consequence, the stop call only triggers a failover of the job instead of 
> stopping the job, if there is an ongoing checkpoint (or savepoint). 
> Possible options I see are: 
> a) change default of tolerable checkpoint failures to at least the max number 
> of concurrent checkpoints
> b) do not count checkpoint failures due to the stop action when checking 
> against tolerable checkpoint failures
> c) do not abort pending checkpoints when stopping a job, but queue the 
> synchronous savepoint after all current in-progress checkpoints



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-15731) Stop while Checkpoint is In-Progress Triggers Job Failover

Reply via email to