[ 
https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943161#comment-16943161
 ] 

Ankur Goenka commented on BEAM-8183:
------------------------------------

I see. Thanks for explaining the use case.

I think hardcoded pipeline options are definitely a limitation as of now. We 
can see how we can use Beam's ValueProvider to give dynamic arguments. We can 
also think of overwriting the pipeline options when submitting the jar to fink.
{quote}Running the same pipeline in different environments with different 
parameters is a common need. Virtually everyone has dev/staging/prod or 
whatever their environments are and they want to use the same build artifact. 
That normally requires some amount of parameterization.
{quote}
I don't really have a good solution for dev/staging/prod use case. This is not 
going to be solved by jar with multiple pipelines (as each pipeline will have a 
static set of pipeline options) but by a jar creating dynamic pipelines (as the 
pipeline changes based on the pipeline options and environment).

The major issue to me seems to be that we need to execute pipeline construction 
code which is environment dependent. To generate new pipelines for an 
environment, we need to execute the pipeline submission code in that 
environment. And this is where I see a problem. Python pipelines have to 
execute user code in python using python sdk to construct the pipeline.

Considering this jar as the artifact would not be idle for different 
environment as the actual sdk/lib etc can differ between environments. From 
environment point of view, a docker container capable of submitting the 
pipeline should be an artifact as it has all the dependencies bundled in it and 
is capable of executing code with consistent dependencies. And if we don't want 
consistent dependency across environment, then pipeline code should be 
considered as an artifact as it can work with different dependencies.

 

For context, In dataflow we pack multiple pipeline in a single jar for java and 
for python we generate separate par for each pipeline (We do publish them as a 
single mpm). Further, this does not materialize the pipeline but create an 
executable which is later used in an environment having the right sdk 
installed. The submission process just runs the "python test_pipeline.par 
--runner=DataflowRunner --apiary=testapiary...." which goes though dataflow job 
submission api and is submitted as a regular dataflow job.

This is similar to docker model just that instead of docker we use par file and 
execute it using python/java.
{quote}The other use case is bundling multiple pipelines into the same 
container and select which to run at launch time.
{quote}
This will save some space at the time of deployment. Specifically the jobserver 
jar and pipeline staged artifacts if they are shared. We don't really 
introspect the staged artifacts so we don't know what can be shared and what 
can't across pipelines. I think a better approach would be to just write a 
separate script to merge multiple pipeline jars (jar with single pipeline) and 
replace main class to consider the name of the pipeline to pick the right 
proto. The script can be infrastructure infrastructure aware and can make the 
appropriate lib changes. Beam does not have a notion of multiple pipelines in 
any sense so it will be interesting to see how we model this if we decide to 
introduce it in beam.

Note: As the pipelines are materialized, they will still not work across 
environments.

 

Please let me know if you have any ideas for solution this.

 

> Optionally bundle multiple pipelines into a single Flink jar
> ------------------------------------------------------------
>
>                 Key: BEAM-8183
>                 URL: https://issues.apache.org/jira/browse/BEAM-8183
>             Project: Beam
>          Issue Type: New Feature
>          Components: runner-flink
>            Reporter: Kyle Weaver
>            Assignee: Kyle Weaver
>            Priority: Major
>              Labels: portability-flink
>
> [https://github.com/apache/beam/pull/9331#issuecomment-526734851]
> "With Flink you can bundle multiple entry points into the same jar file and 
> specify which one to use with optional flags. It may be desirable to allow 
> inclusion of multiple pipelines for this tool also, although that would 
> require a different workflow. Absent this option, it becomes quite convoluted 
> for users that need the flexibility to choose which pipeline to launch at 
> submission time."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to