[jira] [Commented] (FLINK-19778) Failed job reinitiated with wrong checkpoint after a ZK reconnection

Paul Lin (Jira) Fri, 23 Oct 2020 00:38:30 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-19778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219521#comment-17219521
 ]


Paul Lin commented on FLINK-19778:
----------------------------------

I think it's because the job turned into FAILED state, which is a global 
terminated state, so the checkpoint entries were removed from zookeeper.

And unfortunately, the job is canceled afterwards, and the application path on 
zookeeper was cleaned up, so we can't get more information from zookeeper.

> Failed job reinitiated with wrong checkpoint after a ZK reconnection
> --------------------------------------------------------------------
>
>                 Key: FLINK-19778
>                 URL: https://issues.apache.org/jira/browse/FLINK-19778
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.0
>            Reporter: Paul Lin
>            Priority: Critical
>         Attachments: jm_log
>
>
> We have a job of Flink 1.11.0 running on YARN that reached FAILED state 
> because its jobmanager lost leadership during a ZK full GC. But after the ZK 
> connection was recovered, somehow the job was reinitiated again with no 
> checkpoints found in ZK, and hence an earlier savepoint was used to restore 
> the job, which rewound the job unexpectedly.
>   
>  For details please see the jobmanager logs in the attachment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-19778) Failed job reinitiated with wrong checkpoint after a ZK reconnection

Reply via email to