[jira] [Commented] (FLINK-21251) Last valid checkpoint metadata lost after job exits restart loop

Paul Lin (Jira) Wed, 03 Feb 2021 04:04:03 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-21251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277946#comment-17277946
 ]


Paul Lin commented on FLINK-21251:
----------------------------------

It turns out that the user disabled externalized checkpoint (which was 
forbidden in our company), so the jobmanager would not write out its metadata. 

> Last valid checkpoint metadata lost after job exits restart loop
> ----------------------------------------------------------------
>
>                 Key: FLINK-21251
>                 URL: https://issues.apache.org/jira/browse/FLINK-21251
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.7.2
>            Reporter: Paul Lin
>            Priority: Critical
>         Attachments: ch-4585 content.png, checkpoint dir.png, jm_logs
>
>
> We have a Flink job of a relatively old version, 1.7.1, that failed with no 
> valid checkpoint to restore. The job was first affected by a Kafka network 
> instability and fell into the restart loop with the policy of 3 restarts in 5 
> minutes. After the restarts exhausted, the job turned into the final state 
> FAILED and exits. But the problem is that the last valid checkpoint 4585 that 
> was restored multiple times during the restarts, was corrupted (no _metadata) 
> after the job exited. 
> I've checked the checkpoint dir on HDFS and found that chk-4585 which was 
> finished at 12:16 was modified at 12:23 when jobmanager was shutting down 
> with lots of error logs saying the deletes of pending checkpoints somehow 
> failed. So I'm suspecting that the checkpoint metadata was unexpectedly 
> deleted by jobmanager.
> The jobmanager logs are attached below.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-21251) Last valid checkpoint metadata lost after job exits restart loop

Reply via email to