[jira] [Commented] (FLINK-18091) Test Relocatable Savepoints

Congxian Qiu(klion26) (Jira) Fri, 05 Jun 2020 06:39:12 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126769#comment-17126769
 ]


Congxian Qiu(klion26) commented on FLINK-18091:
-----------------------------------------------

Test on a real cluster, Result excepted.
 * savepoint relocate can restore successfully
 * checkpoint relocate will be failed with FileNotFoundException

The Long log attached below:

{{_username/ip/port and other sensitive information has been masked._}}
 # For Savepoint

 
{code:java}
[~/flink-1.11-SNAPSHOT]$ ./bin/flink savepoint 9bcc2546a841b36a39c46fbe13a2b631 
hdfs:///user/xxxxxx/congxianqiu/savepoint -yid application_1591259429117_0007
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/data/work/congxianqiu/flink-1.11-SNAPSHOT/lib/log4j-slf4j-impl-2.12.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/data/xxxxxx/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2020-06-05 20:27:43,039 WARN  
org.apache.flink.yarn.configuration.YarnLogConfigUtil        [] - The 
configuration directory ('/data/work/congxianqiu/flink-1.11-SNAPSHOT/conf') 
already contains a LOG4J config file.If you want to use logback, then please 
delete or rename the log configuration file.
2020-06-05 20:27:43,422 INFO  org.apache.flink.yarn.YarnClusterDescriptor       
           [] - No path for the flink jar passed. Using the location of class 
org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2020-06-05 20:27:43,513 INFO  org.apache.flink.yarn.YarnClusterDescriptor       
           [] - Found Web Interface 10-215-128-84:35572 of application 
'application_1591259429117_0007'.
Triggering savepoint for job 9bcc2546a841b36a39c46fbe13a2b631.
Waiting for response...
Savepoint completed. Path: 
hdfs://ip:port/user/xxxxxx/congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
You can resume your program from this savepoint with the run command.

[~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint
Found 1 items
drwxr-xr-x   - xxxxxx supergroup          0 2020-06-05 20:27 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
[~/congxianqiu/flink-1.11-SNAPSHOT]$ hadoop fs -ls 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
Found 2 items
-rw-r--r--   3 xxxxxx supergroup      74170 2020-06-05 20:27 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a
-rw-r--r--   3 xxxxxx supergroup       1205 2020-06-05 20:27 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/_metadata



[~/flink-1.11-SNAPSHOT]$ hadoop fs -mv 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 
congxianqiu/savepoint/newsavepointpath
[ ~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint
Found 1 items
drwxr-xr-x   - xxxxxx supergroup          0 2020-06-05 20:27 
congxianqiu/savepoint/newsavepointpath
[~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint/newsavepointpath
Found 2 items
-rw-r--r--   3 xxxxxx supergroup      74170 2020-06-05 20:27 
congxianqiu/savepoint/newsavepointpath/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a
-rw-r--r--   3 xxxxxx supergroup       1205 2020-06-05 20:27 
congxianqiu/savepoint/newsavepointpath/_metadata


[~/flink-1.11-SNAPSHOT]$ ./bin/flink run -s 
hdfs:///user/xxxxxx/congxianqiu/newsavepointpath/_metadata -m yarn-cluster -c 
com.klion26.data.FlinkDemo 
/data/work/congxianqiu/flink-1.11-SNAPSHOT/ft_local/Flink-Demo-1.0-SNAPSHOT.jar
SLF4J: Class path contains multiple SLF4J bindings.


>>>>>> jobmanager.log
2020-06-05 21:11:10,053 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Starting job 
b2fbfa6527391f035e8eebd791c2f64e from savepoint 
hdfs:///user/xxxxxx/congxianqiu/savepoint/newsavepointpath/_metadata ()
2020-06-05 21:11:10,198 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Reset the 
checkpoint ID of job b2fbfa6527391f035e8eebd791c2f64e to 3.
2020-06-05 21:11:10,198 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Restoring job 
b2fbfa6527391f035e8eebd791c2f64e from latest valid checkpoint: Checkpoint 2 @ 0 
for b2fbfa6527391f035e8eebd791c2f64e.
2020-06-05 21:11:10,206 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - No master 
state to restore
.......
2020-06-05 21:11:16,117 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Checkpoint 
triggering task Source: Custom Source (1/1) of job 
b2fbfa6527391f035e8eebd791c2f64e is not in state RUNNING but SCHEDULED instead. 
Aborting checkpoint.
2020-06-05 21:11:19,456 INFO  org.apache.flink.yarn.YarnResourceManager         
           [] - Registering TaskManager with ResourceID 
container_e18_1591259429117_0019_01_000002 
(akka.tcp://flink@10-215-128-83:56603/user/rpc/taskmanager_0) at ResourceManager
2020-06-05 21:11:19,566 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: 
Custom Source (1/1) (bc206dbf0e964487e5ba8c4355cb691e) switched from SCHEDULED 
to DEPLOYING.
2020-06-05 21:11:19,566 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Deploying 
Source: Custom Source (1/1) (attempt #0) to 
container_e18_1591259429117_0019_01_000002 @ 10-215-128-83 (dataPort=45167)
2020-06-05 21:11:19,572 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Map -> Sink: 
Unnamed (1/1) (7ae861cf453455d722d5d4ece0c10d1a) switched from SCHEDULED to 
DEPLOYING.
2020-06-05 21:11:19,573 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Deploying Map 
-> Sink: Unnamed (1/1) (attempt #0) to 
container_e18_1591259429117_0019_01_000002 @ 10-215-128-83 (dataPort=45167)
2020-06-05 21:11:20,467 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Map -> Sink: 
Unnamed (1/1) (7ae861cf453455d722d5d4ece0c10d1a) switched from DEPLOYING to 
RUNNING.
2020-06-05 21:11:20,468 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: 
Custom Source (1/1) (bc206dbf0e964487e5ba8c4355cb691e) switched from DEPLOYING 
to RUNNING.
2020-06-05 21:12:16,199 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
checkpoint 3 @ 1591362736116 for job b2fbfa6527391f035e8eebd791c2f64e.
2020-06-05 21:12:16,854 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
checkpoint 3 for job b2fbfa6527391f035e8eebd791c2f64e (106237 bytes in 736 ms).
2020-06-05 21:13:16,172 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
checkpoint 4 @ 1591362796116 for job b2fbfa6527391f035e8eebd791c2f64e.
2020-06-05 21:13:16,680 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
checkpoint 4 for job b2fbfa6527391f035e8eebd791c2f64e (32823 bytes in 542 ms).
{code}
 

 

2. log for checkpoint

 
{code:java}
[~/flink-1.11-SNAPSHOT]$ hadoop fs -ls 
congxianqiu/checkpoint/b2fbfa6527391f035e8eebd791c2f64e/      
Found 3 items
drwxr-xr-x   - xxxxxx supergroup          0 2020-06-05 21:15 
congxianqiu/checkpoint/b2fbfa6527391f035e8eebd791c2f64e/chk-6
drwxr-xr-x   - xxxxxx supergroup          0 2020-06-05 21:15 
congxianqiu/checkpoint/b2fbfa6527391f035e8eebd791c2f64e/shared
drwxr-xr-x   - xxxxxx supergroup          0 2020-06-05 21:11 
congxianqiu/checkpoint/b2fbfa6527391f035e8eebd791c2f64e/taskowned


[ ~/flink-1.11-SNAPSHOT]$ hadoop fs -mv 
congxianqiu/checkpoint/b2fbfa6527391f035e8eebd791c2f64e 
congxianqiu/checkpoint/movecheckpoint
[~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/checkpoint/movecheckpoint
Found 3 items
drwxr-xr-x   - xxxxxx supergroup          0 2020-06-05 21:23 
congxianqiu/checkpoint/movecheckpoint/chk-6
drwxr-xr-x   - xxxxxx supergroup          0 2020-06-05 21:23 
congxianqiu/checkpoint/movecheckpoint/shared
drwxr-xr-x   - xxxxxx supergroup          0 2020-06-05 21:23 
congxianqiu/checkpoint/movecheckpoint/taskowned



jobmanager.log

Caused by: org.apache.hadoop.ipc.RemoteException: File does not exist: 
/user/xxxxxx/congxianqiu/checkpoint/b2fbfa6527391f035e8eebd791c2f64e/shared/56704ae2-d
04c-4073-aa6b-843a40e15bbe
        at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1836)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1808)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1723)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:588)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java
:366)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2217)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2213)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2213)


        at org.apache.hadoop.ipc.Client.call(Client.java:1476) 
~[hadoop-common-2.7.4.jar:?]
        at org.apache.hadoop.ipc.Client.call(Client.java:1413) 
~[hadoop-common-2.7.4.jar:?]
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
 ~[hadoop-common-2.7.4.jar:?]
        at com.sun.proxy.$Proxy35.getBlockLocations(Unknown Source) ~[?:?]
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:255)
 ~[hadoop-hdfs-
2.7.4.jar:?]
{code}
 

 

> Test Relocatable Savepoints
> ---------------------------
>
>                 Key: FLINK-18091
>                 URL: https://issues.apache.org/jira/browse/FLINK-18091
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Tests
>            Reporter: Stephan Ewen
>            Assignee: Congxian Qiu(klion26)
>            Priority: Major
>              Labels: release-testing
>             Fix For: 1.11.0
>
>
> The test should do the following:
>  * take a savepoint. needs to make sure the job has enough state that there 
> is more than just the "_metadata" file
>  * copy it to another directory
>  * start the job from that savepoint by addressing the metadata file and by 
> addressing the savepoint directory
> We should also test that an incremental checkpoint that gets moved fails with 
> a reasonable exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18091) Test Relocatable Savepoints

Reply via email to