[
https://issues.apache.org/jira/browse/FLINK-21251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277946#comment-17277946
]
Paul Lin commented on FLINK-21251:
----------------------------------
It turns out that the user disabled externalized checkpoint (which was
forbidden in our company), so the jobmanager would not write out its metadata.
> Last valid checkpoint metadata lost after job exits restart loop
> ----------------------------------------------------------------
>
> Key: FLINK-21251
> URL: https://issues.apache.org/jira/browse/FLINK-21251
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.7.2
> Reporter: Paul Lin
> Priority: Critical
> Attachments: ch-4585 content.png, checkpoint dir.png, jm_logs
>
>
> We have a Flink job of a relatively old version, 1.7.1, that failed with no
> valid checkpoint to restore. The job was first affected by a Kafka network
> instability and fell into the restart loop with the policy of 3 restarts in 5
> minutes. After the restarts exhausted, the job turned into the final state
> FAILED and exits. But the problem is that the last valid checkpoint 4585 that
> was restored multiple times during the restarts, was corrupted (no _metadata)
> after the job exited.
> I've checked the checkpoint dir on HDFS and found that chk-4585 which was
> finished at 12:16 was modified at 12:23 when jobmanager was shutting down
> with lots of error logs saying the deletes of pending checkpoints somehow
> failed. So I'm suspecting that the checkpoint metadata was unexpectedly
> deleted by jobmanager.
> The jobmanager logs are attached below.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)