[ 
https://issues.apache.org/jira/browse/FLINK-31077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17688858#comment-17688858
 ] 

Zhu Zhu edited comment on FLINK-31077 at 2/20/23 9:52 AM:
----------------------------------------------------------

Thanks for reporting this issue! [~JunRuiLi]
-I think it is indeed a problem. Considering the case of stop-with-savepoint, 
it's possible that the final savepoint is lost if the savepoint is considered 
to be done and the job gets terminated, before it is recorded to HA.-
Do you want to fix it?

Correction: The problem does not affect savepoints which do not rely on 
CompletedCheckpointStore. So the actual problem will be that the query result 
of a [manually triggered 
checkpoint|https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-jobid-checkpoints]
 is returned as "COMPLETED", while on the web UI it is "FAILED", which may 
confuse users.


was (Author: zhuzh):
Thanks for reporting this issue! [~JunRuiLi]
I think it is indeed a problem. Considering the case of stop-with-savepoint, 
it's possible that the final savepoint is lost if the savepoint is considered 
to be done and the job gets terminated, before it is recorded to HA.
Do you want to fix it?

> Trigger checkpoint failed but it were shown as COMPLETED by rest API
> --------------------------------------------------------------------
>
>                 Key: FLINK-31077
>                 URL: https://issues.apache.org/jira/browse/FLINK-31077
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.17.0, 1.15.3, 1.16.1
>            Reporter: Junrui Li
>            Assignee: Junrui Li
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.17.0, 1.15.4, 1.16.2
>
>
> Currently, we can trigger a checkpoint and poll the status of the checkpoint 
> until it is finished by rest according to FLINK-27101. However, even if the 
> checkpoint status returned by rest is completed, it does not mean that the 
> checkpoint is really completed. If an exception occurs after marking the 
> pendingCheckpoint 
> completed([here|https://github.com/apache/flink/blob/bf0ad52cbcb052961c54c94c7013f5ac0110ef8a/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1309]),
>  the checkpoint is not written to the HA service and we can not failover from 
> this checkpoint.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to