[
https://issues.apache.org/jira/browse/FLINK-28431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563540#comment-17563540
]
aresyhzhang commented on FLINK-28431:
-------------------------------------
Not tried with latest flink 1.15, but flink 1.14.2 will also appear
> CompletedCheckPoints stored on ZooKeeper is not up-to-date, when JobManager
> is restarted it fails to recover the job due to "checkpoint FileNotFound
> exception"
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-28431
> URL: https://issues.apache.org/jira/browse/FLINK-28431
> Project: Flink
> Issue Type: Bug
> Affects Versions: 1.13.2
> Environment: flink:1.13.2
> java:1.8
> Reporter: aresyhzhang
> Priority: Major
> Labels: checkpoint, ha, native-kubernetes
> Attachments: error.log
>
>
> We have built a lot of flink clusters in native Kubernetes session mode,
> flink version 1.13.2, some clusters can run normally for 180 days and some
> can run for 30 days.
> The following takes an abnormal flink cluster
> flink-k8s-session-opd-public-1132 as an example.
> Problem Description:
> Appears when jobmanager restarts
> File does not exist:
> /home/flink/recovery/flink/flink-k8s-session-opd-public-1132/completedCheckpoint86fce98d7e4a
> The result of this is that the entire flink cluster cannot be started.
> Because other tasks in session mode are also affected by the inability to
> start, the impact is very serious.
> Some auxiliary information:
> 1. flink cluster id: flink-k8s-session-opd-public-1132
> 2. High-availability.storageDir of cluster configuration:
> hdfs://neophdfsv2flink/home/flink/recovery/
> 3.error job id: 18193cde2c359f492f76c8ce4cd20271
> 4. There was a similar issue before: FLINK-8770, but I saw that it was closed
> without being resolved.
> 5. The complete jommanager log I have uploaded to the attachment
> My investigation ideas:
> 1. View the node information on the zookeeper corresponding to the jobid
> 18193cde2c359f492f76c8ce4cd20271:
> [zk: localhost:2181(CONNECTED) 17] ls
> /flink/flink/flink-k8s-session-opd-public-1132/checkpoints/18193cde2c359f492f76c8ce4cd20271
> [0000000000000025852, 0000000000000025851]
> [zk: localhost:2181(CONNECTED) 14] get
> /flink/flink/flink-k8s-session-opd-public-1132/checkpoints/18193cde2c359f492f76c8ce4cd20271/0000000000000025852
> ??sr;org.apache.flink.runtime.state.RetrievableStreamStateHandle?U?+LwrappedStreamStateHandlet2Lorg/apache/flink/runtime/state/StreamStateHandle;xpsr9org.apache.flink.runtime.state.filesystem.FileStateHandle?u?b?J
>
> stateSizefilePathtLorg/apache/flink/core/fs/Path;xp??srorg.apache.flink.core.fs.PathLuritLjava/net/URI;xpsr
> java.net.URI?x.C?I?LstringtLjava/lang/String;xptrhdfs://neophdfsv2flink/home/flink/recovery/flink/flink-k8s-session-opd-public-1132/completedCheckpoint86fce98d7e4ax
> cZxid = 0x1070932e2
> ctime = Wed Jul 06 02:28:51 UTC 2022
> mZxid = 0x1070932e2
> mtime = Wed Jul 06 02:28:51 UTC 2022
> pZxid = 0x30001c957
> cversion=222
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x0
> dataLength = 545
> numChildren = 0.
> I am sure that my zk node is normal, because there are 10+ flink clusters
> using the same zk node, but only this cluster has problems, other clusters
> are normal
> 2. View the hdfs edits modification log of the directory corresponding to
> hdfs:
> ./hdfs-audit.log.1:2022-07-06 10:28:51,752 INFO FSNamesystem.audit:
> allowed=true [email protected] (auth:KERBEROS) ip=/10.91.136.213
> cmd= create
> src=/home/flink/recovery/flink/flink-k8s-session-opd-public-1132/completedCheckpoint86fce98d7e4a
> dst=null perm=flinkuser:flinkuser:rw-r--r-- proto=rpc
> ./hdfs-audit.log.1:2022-07-06 10:29:26,588 INFO FSNamesystem.audit:
> allowed=true [email protected] (auth:KERBEROS) ip=/10.91.136.213
> cmd= delete
> src=/home/flink/recovery/flink/flink-k8s-session-opd-public-1132/completedCheckpoint86fce98d7e4a
> dst=null perm=null proto=rpc
> I don't know why flink created the directory and then deleted it, and did not
> update the metadata information to zookeeper, which caused the jobmanager to
> restart without getting the correct directory and keep restarting.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)