[
https://issues.apache.org/jira/browse/FLINK-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15946785#comment-15946785
]
Haohui Mai commented on FLINK-5668:
-----------------------------------
Sorry for the delayed response.
Our main requirement is to allow Flink to support mission-critical, real-time
applications. Our colleagues want to build mission-critical, real-time
applications that are built on top of Flink. They are concerned about the fact
that not being able to start any jobs when HDFS is down -- today there are no
workarounds for their applications to keep their SLAs when HDFS is under
maintenance.
As you pointed out, there are multiple issues (e.g., checkpoints) to keep the
Flink job running in the above scenario. To get started we would like to be
able to start the job when HDFS is down and address other issues in later jiras.
As a result this essentially reduces to one requirement -- Flink needs to have
an option to bootstrap the jobs without persisting data on {{default.FS}}.
I think https://github.com/apache/flink/pull/2796/files will work as long as
(1) Flink persists everything to that path, and (2) the path can specify a file
system other than {{default.FS}} [~bill.liu8904] can you elaborate why it won't
work for you?
Below are some inlined answers.
{quote}
All the paths are programatically generated and there are no configuration
parameters for passing custom paths (correct me if I'm wrong).
Are you planning to basically fork Flink and create a custom YARN client /
Application Master implementation that allows using custom paths?
{quote}
It is sufficient to just specify the root of the path -- I believe something
like {{yarn.deploy.fs}} or https://github.com/apache/flink/pull/2796/files will
work.
{quote}
I think we didn't have your use case in mind when implementing the code. We
assumed that one file system will be used for distributing all required files.
Also, this approach works nicely will all the Hadoop vendor's versions.
{quote}
We originally shared the same line of thoughts that HDFS HA should be
sufficient. The problem is that mission-critical real-time applications have a
much stricter SLA that HDFS thus they need to survive from HDFS downtime.
{quote}
The general theme is: Some persistent store is needed currently, at least for
high-availability modes. Decoupling Yarn from a persistent store pushes the
responsibility to another layer.
{quote}
Totally agree. Whether it is in HA mode or not, having a distributed file
system underneath simplifies things a lot. Passing state as configuration /
environment variables is just one solution but not necessarily the best one. I
think we are good to go as long as Flink is able to bootstrap the jobs from
places other than {{default.FS}}.
Thoughts?
> passing taskmanager configuration through taskManagerEnv instead of file
> ------------------------------------------------------------------------
>
> Key: FLINK-5668
> URL: https://issues.apache.org/jira/browse/FLINK-5668
> Project: Flink
> Issue Type: Improvement
> Components: YARN
> Reporter: Bill Liu
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> When create a Flink cluster on Yarn, JobManager depends on HDFS to share
> taskmanager-conf.yaml with TaskManager.
> It's better to share the taskmanager-conf.yaml on JobManager Web server
> instead of HDFS, which could reduce the HDFS dependency at job startup.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)