[jira] [Commented] (FLINK-28431) CompletedCheckPoints stored on ZooKeeper is not up-to-date, when JobManager is restarted it fails to recover the job due to "checkpoint FileNotFound exception"

aresyhzhang (Jira) Wed, 06 Jul 2022 19:33:07 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-28431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563540#comment-17563540
 ]


aresyhzhang commented on FLINK-28431:
-------------------------------------

Not tried with latest flink 1.15, but flink 1.14.2 will also appear

> CompletedCheckPoints stored on ZooKeeper is not up-to-date, when JobManager 
> is restarted it fails to recover the job due to "checkpoint FileNotFound 
> exception"
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-28431
>                 URL: https://issues.apache.org/jira/browse/FLINK-28431
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.13.2
>         Environment: flink:1.13.2
> java:1.8
>            Reporter: aresyhzhang
>            Priority: Major
>              Labels: checkpoint, ha, native-kubernetes
>         Attachments: error.log
>
>
> We have built a lot of flink clusters in native Kubernetes session mode, 
> flink version 1.13.2, some clusters can run normally for 180 days and some 
> can run for 30 days.
> The following takes an abnormal flink cluster 
> flink-k8s-session-opd-public-1132 as an example.
> Problem Description:
> Appears when jobmanager restarts
> File does not exist: 
> /home/flink/recovery/flink/flink-k8s-session-opd-public-1132/completedCheckpoint86fce98d7e4a
> The result of this is that the entire flink cluster cannot be started. 
> Because other tasks in session mode are also affected by the inability to 
> start, the impact is very serious.
> Some auxiliary information:
> 1. flink cluster id: flink-k8s-session-opd-public-1132
> 2. High-availability.storageDir of cluster configuration: 
> hdfs://neophdfsv2flink/home/flink/recovery/
> 3.error job id: 18193cde2c359f492f76c8ce4cd20271
> 4. There was a similar issue before: FLINK-8770, but I saw that it was closed 
> without being resolved.
> 5. The complete jommanager log I have uploaded to the attachment
> My investigation ideas:
> 1. View the node information on the zookeeper corresponding to the jobid 
> 18193cde2c359f492f76c8ce4cd20271:
> [zk: localhost:2181(CONNECTED) 17] ls 
> /flink/flink/flink-k8s-session-opd-public-1132/checkpoints/18193cde2c359f492f76c8ce4cd20271
> [0000000000000025852, 0000000000000025851]
> [zk: localhost:2181(CONNECTED) 14] get 
> /flink/flink/flink-k8s-session-opd-public-1132/checkpoints/18193cde2c359f492f76c8ce4cd20271/0000000000000025852
> ??sr;org.apache.flink.runtime.state.RetrievableStreamStateHandle?U?+LwrappedStreamStateHandlet2Lorg/apache/flink/runtime/state/StreamStateHandle;xpsr9org.apache.flink.runtime.state.filesystem.FileStateHandle?u?b?J
>  
> stateSizefilePathtLorg/apache/flink/core/fs/Path;xp??srorg.apache.flink.core.fs.PathLuritLjava/net/URI;xpsr
> java.net.URI?x.C?I?LstringtLjava/lang/String;xptrhdfs://neophdfsv2flink/home/flink/recovery/flink/flink-k8s-session-opd-public-1132/completedCheckpoint86fce98d7e4ax
> cZxid = 0x1070932e2
> ctime = Wed Jul 06 02:28:51 UTC 2022
> mZxid = 0x1070932e2
> mtime = Wed Jul 06 02:28:51 UTC 2022
> pZxid = 0x30001c957
> cversion=222
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x0
> dataLength = 545
> numChildren = 0.
> I am sure that my zk node is normal, because there are 10+ flink clusters 
> using the same zk node, but only this cluster has problems, other clusters 
> are normal
> 2. View the hdfs edits modification log of the directory corresponding to 
> hdfs:
> ./hdfs-audit.log.1:2022-07-06 10:28:51,752 INFO FSNamesystem.audit: 
> allowed=true [email protected] (auth:KERBEROS) ip=/10.91.136.213 
> cmd= create 
> src=/home/flink/recovery/flink/flink-k8s-session-opd-public-1132/completedCheckpoint86fce98d7e4a
>  dst=null perm=flinkuser:flinkuser:rw-r--r-- proto=rpc
> ./hdfs-audit.log.1:2022-07-06 10:29:26,588 INFO FSNamesystem.audit: 
> allowed=true [email protected] (auth:KERBEROS) ip=/10.91.136.213 
> cmd= delete 
> src=/home/flink/recovery/flink/flink-k8s-session-opd-public-1132/completedCheckpoint86fce98d7e4a
>  dst=null perm=null proto=rpc
> I don't know why flink created the directory and then deleted it, and did not 
> update the metadata information to zookeeper, which caused the jobmanager to 
> restart without getting the correct directory and keep restarting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-28431) CompletedCheckPoints stored on ZooKeeper is not up-to-date, when JobManager is restarted it fails to recover the job due to "checkpoint FileNotFound exception"

Reply via email to