[ 
https://issues.apache.org/jira/browse/FLINK-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884822#comment-16884822
 ] 

Zhenqiu Huang commented on FLINK-13132:
---------------------------------------

[~maguowei]
For your questions, I think you are mainly asking how to guarantee the lossless 
when moving job in clusters. 

1) We only moving a job within clusters in the same region (networking latency 
< 2ms). Thus, most of the time we don't change the data source, thus a job can 
always read from last committed offset (if no checkpoint). For jobs with state, 
there is another story. Our storage team built a blob management system on top 
of internal HDFS, S3 and GCS. They provides data placement policy and data 
replication service for us. For example, we define a job with its state stored 
a blob folder, and the blob configured to be replicated to GCS. When we want to 
restart a stateful job from a cluster a in local to a cluster into cloud with 
latest checkpoint. The checkpoint is already copied to cloud within the same 
namespace. 

2) It is a good suggestion. I am always looking into how to utilize exiting HA 
setting for storing jobgraph. I think I can storage the jobgraph to zookeeper 
in HA mode in the first time launch, so that when application master fail and 
recover the job graph can be reused. Let's discuss in details on the coming PR.

 



> Allow ClusterEntrypoints use user main method to generate job graph
> -------------------------------------------------------------------
>
>                 Key: FLINK-13132
>                 URL: https://issues.apache.org/jira/browse/FLINK-13132
>             Project: Flink
>          Issue Type: Improvement
>          Components: Deployment / YARN
>    Affects Versions: 1.8.0, 1.8.1
>            Reporter: Zhenqiu Huang
>            Assignee: Zhenqiu Huang
>            Priority: Minor
>
> We are building a service that can transparently deploy a job to different 
> cluster management systems, such as Yarn and another internal system. It is 
> very cost to download the jar and generate JobGraph in the client side. Thus, 
> I want to propose an improvement to make Yarn Entrypoints can be configurable 
> to use either FileJobGraphRetriever or ClassPathJobGraphRetriever. It is 
> actually a long asking TODO in AbstractionYarnClusterDescriptor in line 834.
> https://github.com/apache/flink/blob/21468e0050dc5f97de5cfe39885e0d3fd648e399/flink-yarn/src/main/java/org/apache/flink/yarn/AbstractYarnClusterDescriptor.java#L834



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to