[
https://issues.apache.org/jira/browse/FLINK-22088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Piotr Nowojski updated FLINK-22088:
-----------------------------------
Priority: Not a Priority (was: Minor)
> CheckpointCoordinator might not be able to abort triggering checkpoint if
> failover happens during triggering
> ------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-22088
> URL: https://issues.apache.org/jira/browse/FLINK-22088
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.12.2, 1.13.0
> Reporter: Yun Gao
> Assignee: Yun Gao
> Priority: Not a Priority
> Labels: auto-unassigned, stale-assigned
>
> Currently when job failover, it would try to cancel all the pending
> checkpoint via CheckpointCoordinatorDeActivator#jobStatusChanges ->
> stopCheckpointScheduler, it would try to cancel all the pending checkpoints
> and also set periodicScheduling to false.
> If at this time there is just one checkpoint start triggering, it might
> acquire all the execution to trigger before failover and start triggering.
> ideally it should be aborted in createPendingCheckpoint->
> preCheckGlobalState. However, since the check and creating pending checkpoint
> is in two different scope, there might be cases the
> CheckpointCoordinator#stopCheckpointScheduler happens during the two lock
> scope.
> We may optimize this checking; However, since the execution would finally
> fail to trigger checkpoint, it should not affect the rightness of the job.
> Besides, even if we optimize it, there might still be cases that the
> execution trigger failed due to concurrent failover.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)