[
https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126769#comment-17126769
]
Congxian Qiu(klion26) commented on FLINK-18091:
-----------------------------------------------
Test on a real cluster, Result excepted.
* savepoint relocate can restore successfully
* checkpoint relocate will be failed with FileNotFoundException
The Long log attached below:
{{_username/ip/port and other sensitive information has been masked._}}
# For Savepoint
{code:java}
[~/flink-1.11-SNAPSHOT]$ ./bin/flink savepoint 9bcc2546a841b36a39c46fbe13a2b631
hdfs:///user/xxxxxx/congxianqiu/savepoint -yid application_1591259429117_0007
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/data/work/congxianqiu/flink-1.11-SNAPSHOT/lib/log4j-slf4j-impl-2.12.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/data/xxxxxx/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2020-06-05 20:27:43,039 WARN
org.apache.flink.yarn.configuration.YarnLogConfigUtil [] - The
configuration directory ('/data/work/congxianqiu/flink-1.11-SNAPSHOT/conf')
already contains a LOG4J config file.If you want to use logback, then please
delete or rename the log configuration file.
2020-06-05 20:27:43,422 INFO org.apache.flink.yarn.YarnClusterDescriptor
[] - No path for the flink jar passed. Using the location of class
org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2020-06-05 20:27:43,513 INFO org.apache.flink.yarn.YarnClusterDescriptor
[] - Found Web Interface 10-215-128-84:35572 of application
'application_1591259429117_0007'.
Triggering savepoint for job 9bcc2546a841b36a39c46fbe13a2b631.
Waiting for response...
Savepoint completed. Path:
hdfs://ip:port/user/xxxxxx/congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
You can resume your program from this savepoint with the run command.
[~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint
Found 1 items
drwxr-xr-x - xxxxxx supergroup 0 2020-06-05 20:27
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
[~/congxianqiu/flink-1.11-SNAPSHOT]$ hadoop fs -ls
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
Found 2 items
-rw-r--r-- 3 xxxxxx supergroup 74170 2020-06-05 20:27
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a
-rw-r--r-- 3 xxxxxx supergroup 1205 2020-06-05 20:27
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/_metadata
[~/flink-1.11-SNAPSHOT]$ hadoop fs -mv
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
congxianqiu/savepoint/newsavepointpath
[ ~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint
Found 1 items
drwxr-xr-x - xxxxxx supergroup 0 2020-06-05 20:27
congxianqiu/savepoint/newsavepointpath
[~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint/newsavepointpath
Found 2 items
-rw-r--r-- 3 xxxxxx supergroup 74170 2020-06-05 20:27
congxianqiu/savepoint/newsavepointpath/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a
-rw-r--r-- 3 xxxxxx supergroup 1205 2020-06-05 20:27
congxianqiu/savepoint/newsavepointpath/_metadata
[~/flink-1.11-SNAPSHOT]$ ./bin/flink run -s
hdfs:///user/xxxxxx/congxianqiu/newsavepointpath/_metadata -m yarn-cluster -c
com.klion26.data.FlinkDemo
/data/work/congxianqiu/flink-1.11-SNAPSHOT/ft_local/Flink-Demo-1.0-SNAPSHOT.jar
SLF4J: Class path contains multiple SLF4J bindings.
>>>>>> jobmanager.log
2020-06-05 21:11:10,053 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Starting job
b2fbfa6527391f035e8eebd791c2f64e from savepoint
hdfs:///user/xxxxxx/congxianqiu/savepoint/newsavepointpath/_metadata ()
2020-06-05 21:11:10,198 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Reset the
checkpoint ID of job b2fbfa6527391f035e8eebd791c2f64e to 3.
2020-06-05 21:11:10,198 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job
b2fbfa6527391f035e8eebd791c2f64e from latest valid checkpoint: Checkpoint 2 @ 0
for b2fbfa6527391f035e8eebd791c2f64e.
2020-06-05 21:11:10,206 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - No master
state to restore
.......
2020-06-05 21:11:16,117 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint
triggering task Source: Custom Source (1/1) of job
b2fbfa6527391f035e8eebd791c2f64e is not in state RUNNING but SCHEDULED instead.
Aborting checkpoint.
2020-06-05 21:11:19,456 INFO org.apache.flink.yarn.YarnResourceManager
[] - Registering TaskManager with ResourceID
container_e18_1591259429117_0019_01_000002
(akka.tcp://flink@10-215-128-83:56603/user/rpc/taskmanager_0) at ResourceManager
2020-06-05 21:11:19,566 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source:
Custom Source (1/1) (bc206dbf0e964487e5ba8c4355cb691e) switched from SCHEDULED
to DEPLOYING.
2020-06-05 21:11:19,566 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Deploying
Source: Custom Source (1/1) (attempt #0) to
container_e18_1591259429117_0019_01_000002 @ 10-215-128-83 (dataPort=45167)
2020-06-05 21:11:19,572 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Map -> Sink:
Unnamed (1/1) (7ae861cf453455d722d5d4ece0c10d1a) switched from SCHEDULED to
DEPLOYING.
2020-06-05 21:11:19,573 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Deploying Map
-> Sink: Unnamed (1/1) (attempt #0) to
container_e18_1591259429117_0019_01_000002 @ 10-215-128-83 (dataPort=45167)
2020-06-05 21:11:20,467 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Map -> Sink:
Unnamed (1/1) (7ae861cf453455d722d5d4ece0c10d1a) switched from DEPLOYING to
RUNNING.
2020-06-05 21:11:20,468 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source:
Custom Source (1/1) (bc206dbf0e964487e5ba8c4355cb691e) switched from DEPLOYING
to RUNNING.
2020-06-05 21:12:16,199 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 3 @ 1591362736116 for job b2fbfa6527391f035e8eebd791c2f64e.
2020-06-05 21:12:16,854 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
checkpoint 3 for job b2fbfa6527391f035e8eebd791c2f64e (106237 bytes in 736 ms).
2020-06-05 21:13:16,172 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 4 @ 1591362796116 for job b2fbfa6527391f035e8eebd791c2f64e.
2020-06-05 21:13:16,680 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
checkpoint 4 for job b2fbfa6527391f035e8eebd791c2f64e (32823 bytes in 542 ms).
{code}
2. log for checkpoint
{code:java}
[~/flink-1.11-SNAPSHOT]$ hadoop fs -ls
congxianqiu/checkpoint/b2fbfa6527391f035e8eebd791c2f64e/
Found 3 items
drwxr-xr-x - xxxxxx supergroup 0 2020-06-05 21:15
congxianqiu/checkpoint/b2fbfa6527391f035e8eebd791c2f64e/chk-6
drwxr-xr-x - xxxxxx supergroup 0 2020-06-05 21:15
congxianqiu/checkpoint/b2fbfa6527391f035e8eebd791c2f64e/shared
drwxr-xr-x - xxxxxx supergroup 0 2020-06-05 21:11
congxianqiu/checkpoint/b2fbfa6527391f035e8eebd791c2f64e/taskowned
[ ~/flink-1.11-SNAPSHOT]$ hadoop fs -mv
congxianqiu/checkpoint/b2fbfa6527391f035e8eebd791c2f64e
congxianqiu/checkpoint/movecheckpoint
[~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/checkpoint/movecheckpoint
Found 3 items
drwxr-xr-x - xxxxxx supergroup 0 2020-06-05 21:23
congxianqiu/checkpoint/movecheckpoint/chk-6
drwxr-xr-x - xxxxxx supergroup 0 2020-06-05 21:23
congxianqiu/checkpoint/movecheckpoint/shared
drwxr-xr-x - xxxxxx supergroup 0 2020-06-05 21:23
congxianqiu/checkpoint/movecheckpoint/taskowned
jobmanager.log
Caused by: org.apache.hadoop.ipc.RemoteException: File does not exist:
/user/xxxxxx/congxianqiu/checkpoint/b2fbfa6527391f035e8eebd791c2f64e/shared/56704ae2-d
04c-4073-aa6b-843a40e15bbe
at
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
at
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1836)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1808)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1723)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:588)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java
:366)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2217)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2213)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2213)
at org.apache.hadoop.ipc.Client.call(Client.java:1476)
~[hadoop-common-2.7.4.jar:?]
at org.apache.hadoop.ipc.Client.call(Client.java:1413)
~[hadoop-common-2.7.4.jar:?]
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
~[hadoop-common-2.7.4.jar:?]
at com.sun.proxy.$Proxy35.getBlockLocations(Unknown Source) ~[?:?]
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:255)
~[hadoop-hdfs-
2.7.4.jar:?]
{code}
> Test Relocatable Savepoints
> ---------------------------
>
> Key: FLINK-18091
> URL: https://issues.apache.org/jira/browse/FLINK-18091
> Project: Flink
> Issue Type: Sub-task
> Components: Tests
> Reporter: Stephan Ewen
> Assignee: Congxian Qiu(klion26)
> Priority: Major
> Labels: release-testing
> Fix For: 1.11.0
>
>
> The test should do the following:
> * take a savepoint. needs to make sure the job has enough state that there
> is more than just the "_metadata" file
> * copy it to another directory
> * start the job from that savepoint by addressing the metadata file and by
> addressing the savepoint directory
> We should also test that an incremental checkpoint that gets moved fails with
> a reasonable exception.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)