[jira] [Updated] (FLINK-22088) CheckpointCoordinator might not be able to abort triggering checkpoint if failover happens during triggering

Piotr Nowojski (Jira) Mon, 01 Nov 2021 03:59:06 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-22088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Piotr Nowojski updated FLINK-22088:
-----------------------------------
    Priority: Not a Priority  (was: Minor)

> CheckpointCoordinator might not be able to abort triggering checkpoint if 
> failover happens during triggering
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-22088
>                 URL: https://issues.apache.org/jira/browse/FLINK-22088
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.12.2, 1.13.0
>            Reporter: Yun Gao
>            Assignee: Yun Gao
>            Priority: Not a Priority
>              Labels: auto-unassigned, stale-assigned
>
> Currently when job failover, it would try to cancel all the pending 
> checkpoint via CheckpointCoordinatorDeActivator#jobStatusChanges -> 
> stopCheckpointScheduler, it would try to cancel all the pending checkpoints 
> and also set periodicScheduling to false. 
> If at this time there is just one checkpoint start triggering, it might 
> acquire all the execution to trigger before failover and start triggering. 
> ideally it should be aborted in createPendingCheckpoint-> 
> preCheckGlobalState. However, since the check and creating pending checkpoint 
> is in two different scope, there might be cases the 
> CheckpointCoordinator#stopCheckpointScheduler happens during the two lock 
> scope. 
> We may optimize this checking; However, since the execution would finally 
> fail to trigger checkpoint, it should not affect the rightness of the job. 
> Besides, even if we optimize it, there might still be cases that the 
> execution trigger failed due to concurrent failover. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-22088) CheckpointCoordinator might not be able to abort triggering checkpoint if failover happens during triggering

Reply via email to