[jira] [Commented] (FLINK-22088) CheckpointCoordinator might not be able to abort triggering checkpoint if failover happens during triggering

Yun Gao (Jira) Thu, 20 May 2021 01:34:26 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-22088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348149#comment-17348149
 ]


Yun Gao commented on FLINK-22088:
---------------------------------

Hi [~pnowojski], very sorry for missing the notification and reply late. I 
think the fix is not too difficult: we would need to ensure the global state 
check is in the same lock area with recording pending checkpoint, then we could 
ensure either the checkpoint is not recorded after job failed, or the 
checkpoint is recorded before job failed and job failure could abort it. 

Since we must call _statTracker.report()_ complete before we exit the lock area 
that adds pending checkpoint into the _pendingCheckpoints_ (otherwise if the 
pending checkpoint is aborted after we left the lock, it would try to report 
fail status, if we have not track the pending checkpoint yet, there would be 
errors), thus there might be two options:
 * Move the whole process into the lock area:
{code:java}
synchronized(lock) {
     try {
        check global state;
     } catch (Exception e) {
         throw e;
     }
     pendingCheckpoint = new PendingCheckpoint();
     trackPendingCheckpoints(pendingCheckpoint);
     ....
}
{code}

 * Move the check into the lock area, after the pending checkpoint get created 
and tracked.  
{code:java}
pendingCheckpoint = new PendingCheckpoint();
trackPendingCheckpoints(pendingCheckpoint);

synchronized(lock) {
    try {
        check global state;
    } catch (Exception) {
        pendingCheckpoint.abort(...);
        throw e;
    }
    ....
}
{code}

I tend to the second option to not increase the time in the lock area. 

> CheckpointCoordinator might not be able to abort triggering checkpoint if 
> failover happens during triggering
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-22088
>                 URL: https://issues.apache.org/jira/browse/FLINK-22088
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.12.2, 1.13.0
>            Reporter: Yun Gao
>            Priority: Major
>             Fix For: 1.14.0
>
>
> Currently when job failover, it would try to cancel all the pending 
> checkpoint via CheckpointCoordinatorDeActivator#jobStatusChanges -> 
> stopCheckpointScheduler, it would try to cancel all the pending checkpoints 
> and also set periodicScheduling to false. 
> If at this time there is just one checkpoint start triggering, it might 
> acquire all the execution to trigger before failover and start triggering. 
> ideally it should be aborted in createPendingCheckpoint-> 
> preCheckGlobalState. However, since the check and creating pending checkpoint 
> is in two different scope, there might be cases the 
> CheckpointCoordinator#stopCheckpointScheduler happens during the two lock 
> scope. 
> We may optimize this checking; However, since the execution would finally 
> fail to trigger checkpoint, it should not affect the rightness of the job. 
> Besides, even if we optimize it, there might still be cases that the 
> execution trigger failed due to concurrent failover. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-22088) CheckpointCoordinator might not be able to abort triggering checkpoint if failover happens during triggering

Reply via email to