[
https://issues.apache.org/jira/browse/FLINK-22088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340083#comment-17340083
]
Piotr Nowojski commented on FLINK-22088:
----------------------------------------
{quote}
In this case, the checkpoint would finally fail due to expired, and with the
default failure tolerance number it should cause one failover.
{quote}
Can you [~gaoyunhaii] elaborate on this? I'm not sure if I understand it. If
job is already failing over, why this expiration should cause another failover?
Do you mean that this checkpoint expiration for execution attempt N, would
cause a failover of execution attempt N+1?
> CheckpointCoordinator might not be able to abort triggering checkpoint if
> failover happens during triggering
> ------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-22088
> URL: https://issues.apache.org/jira/browse/FLINK-22088
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.12.2, 1.13.0
> Reporter: Yun Gao
> Priority: Minor
>
> Currently when job failover, it would try to cancel all the pending
> checkpoint via CheckpointCoordinatorDeActivator#jobStatusChanges ->
> stopCheckpointScheduler, it would try to cancel all the pending checkpoints
> and also set periodicScheduling to false.
> If at this time there is just one checkpoint start triggering, it might
> acquire all the execution to trigger before failover and start triggering.
> ideally it should be aborted in createPendingCheckpoint->
> preCheckGlobalState. However, since the check and creating pending checkpoint
> is in two different scope, there might be cases the
> CheckpointCoordinator#stopCheckpointScheduler happens during the two lock
> scope.
> We may optimize this checking; However, since the execution would finally
> fail to trigger checkpoint, it should not affect the rightness of the job.
> Besides, even if we optimize it, there might still be cases that the
> execution trigger failed due to concurrent failover.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)