[jira] [Updated] (FLINK-22088) CheckpointCoordinator might not be able to abort triggering checkpoint if failover happens during triggering

Flink Jira Bot (Jira) Tue, 08 Jun 2021 15:43:11 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-22088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Flink Jira Bot updated FLINK-22088:
-----------------------------------
    Labels: stale-assigned  (was: )

I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help 
the community manage its development. I see this issue is assigned but has not 
received an update in 14, so it has been labeled "stale-assigned".
If you are still working on the issue, please remove the label and add a 
comment updating the community on your progress.  If this issue is waiting on 
feedback, please consider this a reminder to the committer/reviewer. Flink is a 
very active project, and so we appreciate your patience.
If you are no longer working on the issue, please unassign yourself so someone 
else may work on it. If the "warning_label" label is not removed in 7 days, the 
issue will be automatically unassigned.


> CheckpointCoordinator might not be able to abort triggering checkpoint if 
> failover happens during triggering
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-22088
>                 URL: https://issues.apache.org/jira/browse/FLINK-22088
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.12.2, 1.13.0
>            Reporter: Yun Gao
>            Assignee: Yun Gao
>            Priority: Major
>              Labels: stale-assigned
>             Fix For: 1.14.0
>
>
> Currently when job failover, it would try to cancel all the pending 
> checkpoint via CheckpointCoordinatorDeActivator#jobStatusChanges -> 
> stopCheckpointScheduler, it would try to cancel all the pending checkpoints 
> and also set periodicScheduling to false. 
> If at this time there is just one checkpoint start triggering, it might 
> acquire all the execution to trigger before failover and start triggering. 
> ideally it should be aborted in createPendingCheckpoint-> 
> preCheckGlobalState. However, since the check and creating pending checkpoint 
> is in two different scope, there might be cases the 
> CheckpointCoordinator#stopCheckpointScheduler happens during the two lock 
> scope. 
> We may optimize this checking; However, since the execution would finally 
> fail to trigger checkpoint, it should not affect the rightness of the job. 
> Besides, even if we optimize it, there might still be cases that the 
> execution trigger failed due to concurrent failover. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-22088) CheckpointCoordinator might not be able to abort triggering checkpoint if failover happens during triggering

Reply via email to