[jira] [Comment Edited] (FLINK-31077) Trigger checkpoint failed but it were shown as COMPLETED by rest API

Zhu Zhu (Jira) Mon, 20 Feb 2023 01:54:11 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-31077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17688858#comment-17688858
 ]


Zhu Zhu edited comment on FLINK-31077 at 2/20/23 9:53 AM:
----------------------------------------------------------

Thanks for reporting this issue! [~JunRuiLi]
-I think it is indeed a problem. Considering the case of stop-with-savepoint, 
it's possible that the final savepoint is lost if the savepoint is considered 
to be done and the job gets terminated, before it is recorded to HA.-
Do you want to fix it?

Correction: The problem does not affect savepoints which do not rely on 
CompletedCheckpointStore. So the actual problem will be that the query result 
of a [manually triggered 
checkpoint|https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-jobid-checkpoints]
 is returned as "COMPLETED", while on the web UI it is "FAILED", which may 
confuse users. 
Therefore the problem is not that critical. I will lower its priority.


was (Author: zhuzh):
Thanks for reporting this issue! [~JunRuiLi]
-I think it is indeed a problem. Considering the case of stop-with-savepoint, 
it's possible that the final savepoint is lost if the savepoint is considered 
to be done and the job gets terminated, before it is recorded to HA.-
Do you want to fix it?

Correction: The problem does not affect savepoints which do not rely on 
CompletedCheckpointStore. So the actual problem will be that the query result 
of a [manually triggered 
checkpoint|https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-jobid-checkpoints]
 is returned as "COMPLETED", while on the web UI it is "FAILED", which may 
confuse users.

> Trigger checkpoint failed but it were shown as COMPLETED by rest API
> --------------------------------------------------------------------
>
>                 Key: FLINK-31077
>                 URL: https://issues.apache.org/jira/browse/FLINK-31077
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.17.0, 1.15.3, 1.16.1
>            Reporter: Junrui Li
>            Assignee: Junrui Li
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.17.0, 1.15.4, 1.16.2
>
>
> Currently, we can trigger a checkpoint and poll the status of the checkpoint 
> until it is finished by rest according to FLINK-27101. However, even if the 
> checkpoint status returned by rest is completed, it does not mean that the 
> checkpoint is really completed. If an exception occurs after marking the 
> pendingCheckpoint 
> completed([here|https://github.com/apache/flink/blob/bf0ad52cbcb052961c54c94c7013f5ac0110ef8a/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1309]),
>  the checkpoint is not written to the HA service and we can not failover from 
> this checkpoint.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-31077) Trigger checkpoint failed but it were shown as COMPLETED by rest API

Reply via email to