aresyhzhang created FLINK-28431:
-----------------------------------

             Summary: CompletedCheckPoints stored on ZooKeeper is not 
up-to-date, when JobManager is restarted it fails to recover the job due to 
"checkpoint FileNotFound exception"
                 Key: FLINK-28431
                 URL: https://issues.apache.org/jira/browse/FLINK-28431
             Project: Flink
          Issue Type: Bug
    Affects Versions: 1.13.2
         Environment: flink:1.13.2
java:1.8
            Reporter: aresyhzhang
         Attachments: error.log

We have built a lot of flink clusters in native Kubernetes session mode, flink 
version 1.13.2, some clusters can run normally for 180 days and some can run 
for 30 days.
The following takes an abnormal flink cluster flink-k8s-session-opd-public-1132 
as an example.

Problem Description:
Appears when jobmanager restarts
File does not exist: 
/home/flink/recovery/flink/flink-k8s-session-opd-public-1132/completedCheckpoint86fce98d7e4a
The result of this is that the entire flink cluster cannot be started. Because 
other tasks in session mode are also affected by the inability to start, the 
impact is very serious.

Some auxiliary information:
1. flink cluster id: flink-k8s-session-opd-public-1132
2. High-availability.storageDir of cluster configuration: 
hdfs://neophdfsv2flink/home/flink/recovery/
3.error job id: 18193cde2c359f492f76c8ce4cd20271
4. There was a similar issue before: FLINK-8770, but I saw that it was closed 
without being resolved.
5. The complete jommanager log I have uploaded to the attachment

My investigation ideas:

1. View the node information on the zookeeper corresponding to the jobid 
18193cde2c359f492f76c8ce4cd20271:

[zk: localhost:2181(CONNECTED) 17] ls 
/flink/flink/flink-k8s-session-opd-public-1132/checkpoints/18193cde2c359f492f76c8ce4cd20271
[0000000000000025852, 0000000000000025851]


[zk: localhost:2181(CONNECTED) 14] get 
/flink/flink/flink-k8s-session-opd-public-1132/checkpoints/18193cde2c359f492f76c8ce4cd20271/0000000000000025852

??sr;org.apache.flink.runtime.state.RetrievableStreamStateHandle?U?+LwrappedStreamStateHandlet2Lorg/apache/flink/runtime/state/StreamStateHandle;xpsr9org.apache.flink.runtime.state.filesystem.FileStateHandle?u?b?J
 
stateSizefilePathtLorg/apache/flink/core/fs/Path;xp??srorg.apache.flink.core.fs.PathLuritLjava/net/URI;xpsr
java.net.URI?x.C?I?LstringtLjava/lang/String;xptrhdfs://neophdfsv2flink/home/flink/recovery/flink/flink-k8s-session-opd-public-1132/completedCheckpoint86fce98d7e4ax
cZxid = 0x1070932e2
ctime = Wed Jul 06 02:28:51 UTC 2022
mZxid = 0x1070932e2
mtime = Wed Jul 06 02:28:51 UTC 2022
pZxid = 0x30001c957
cversion=222
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 545
numChildren = 0.

I am sure that my zk node is normal, because there are 10+ flink clusters using 
the same zk node, but only this cluster has problems, other clusters are normal

2. View the edits modification log of the directory corresponding to hdfs:
./hdfs-audit.log.1:2022-07-06 10:28:51,752 INFO FSNamesystem.audit: 
allowed=true [email protected] (auth:KERBEROS) ip=/10.91.136.213 cmd= 
create 
src=/home/flink/recovery/flink/flink-k8s-session-opd-public-1132/completedCheckpoint86fce98d7e4a
 dst=null perm=flinkuser:flinkuser:rw-r--r-- proto=rpc
./hdfs-audit.log.1:2022-07-06 10:29:26,588 INFO FSNamesystem.audit: 
allowed=true [email protected] (auth:KERBEROS) ip=/10.91.136.213 cmd= 
delete 
src=/home/flink/recovery/flink/flink-k8s-session-opd-public-1132/completedCheckpoint86fce98d7e4a
 dst=null perm=null proto=rpc

I don't know why flink created the directory and then deleted it, and did not 
update the metadata information to zookeeper, which caused the jobmanager to 
restart without getting the correct directory and keep restarting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to