[
https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943161#comment-16943161
]
Ankur Goenka commented on BEAM-8183:
------------------------------------
I see. Thanks for explaining the use case.
I think hardcoded pipeline options are definitely a limitation as of now. We
can see how we can use Beam's ValueProvider to give dynamic arguments. We can
also think of overwriting the pipeline options when submitting the jar to fink.
{quote}Running the same pipeline in different environments with different
parameters is a common need. Virtually everyone has dev/staging/prod or
whatever their environments are and they want to use the same build artifact.
That normally requires some amount of parameterization.
{quote}
I don't really have a good solution for dev/staging/prod use case. This is not
going to be solved by jar with multiple pipelines (as each pipeline will have a
static set of pipeline options) but by a jar creating dynamic pipelines (as the
pipeline changes based on the pipeline options and environment).
The major issue to me seems to be that we need to execute pipeline construction
code which is environment dependent. To generate new pipelines for an
environment, we need to execute the pipeline submission code in that
environment. And this is where I see a problem. Python pipelines have to
execute user code in python using python sdk to construct the pipeline.
Considering this jar as the artifact would not be idle for different
environment as the actual sdk/lib etc can differ between environments. From
environment point of view, a docker container capable of submitting the
pipeline should be an artifact as it has all the dependencies bundled in it and
is capable of executing code with consistent dependencies. And if we don't want
consistent dependency across environment, then pipeline code should be
considered as an artifact as it can work with different dependencies.
For context, In dataflow we pack multiple pipeline in a single jar for java and
for python we generate separate par for each pipeline (We do publish them as a
single mpm). Further, this does not materialize the pipeline but create an
executable which is later used in an environment having the right sdk
installed. The submission process just runs the "python test_pipeline.par
--runner=DataflowRunner --apiary=testapiary...." which goes though dataflow job
submission api and is submitted as a regular dataflow job.
This is similar to docker model just that instead of docker we use par file and
execute it using python/java.
{quote}The other use case is bundling multiple pipelines into the same
container and select which to run at launch time.
{quote}
This will save some space at the time of deployment. Specifically the jobserver
jar and pipeline staged artifacts if they are shared. We don't really
introspect the staged artifacts so we don't know what can be shared and what
can't across pipelines. I think a better approach would be to just write a
separate script to merge multiple pipeline jars (jar with single pipeline) and
replace main class to consider the name of the pipeline to pick the right
proto. The script can be infrastructure infrastructure aware and can make the
appropriate lib changes. Beam does not have a notion of multiple pipelines in
any sense so it will be interesting to see how we model this if we decide to
introduce it in beam.
Note: As the pipelines are materialized, they will still not work across
environments.
Please let me know if you have any ideas for solution this.
> Optionally bundle multiple pipelines into a single Flink jar
> ------------------------------------------------------------
>
> Key: BEAM-8183
> URL: https://issues.apache.org/jira/browse/BEAM-8183
> Project: Beam
> Issue Type: New Feature
> Components: runner-flink
> Reporter: Kyle Weaver
> Assignee: Kyle Weaver
> Priority: Major
> Labels: portability-flink
>
> [https://github.com/apache/beam/pull/9331#issuecomment-526734851]
> "With Flink you can bundle multiple entry points into the same jar file and
> specify which one to use with optional flags. It may be desirable to allow
> inclusion of multiple pipelines for this tool also, although that would
> require a different workflow. Absent this option, it becomes quite convoluted
> for users that need the flexibility to choose which pipeline to launch at
> submission time."
--
This message was sent by Atlassian Jira
(v8.3.4#803005)