[
https://issues.apache.org/jira/browse/BEAM-12792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480109#comment-17480109
]
Jens Wiren commented on BEAM-12792:
-----------------------------------
[~tvalentyn] Thanks, I'll give the mailing list a try!
[~ynait] Beam has several ways to specify dependencies of the beam worker
process (just google beam python dependecies). I have been building a whl from
my pipeline code and then passed that to beam using the --extra-package flag.
BTW, TFX does this automatically in an attempt to ensure that the runner has
the correct TFX version installed.
Currently, I'm doing something similar but there are several limitations:
# You can only run one version of your custom code at any given time on your
flink cluster. If you want to run updated code, you need to redeploy flink.
# You have to manually keep track of when your beam workers custom code is out
of date and only occupies space in your cluster etc
In the ideal (intended!) scenario, you have one auto scaling flink deploy and
you just submit your beam jobs and your dependencies per job
> Beam worker only installs --extra_package once
> ----------------------------------------------
>
> Key: BEAM-12792
> URL: https://issues.apache.org/jira/browse/BEAM-12792
> Project: Beam
> Issue Type: Bug
> Components: sdk-py-harness
> Affects Versions: 2.27.0, 2.28.0, 2.29.0, 2.30.0, 2.31.0
> Environment: Kubernetes 1.20 on Ubuntu 18.04.
> Reporter: Jens Wiren
> Priority: P1
> Labels: FlinkRunner, beam
>
> I'm running TFX pipelines on a Flink cluster using Beam in k8s. However,
> extra python packages passed to the Flink runner (or rather beam worker
> side-car) are only installed once per deployment cycle. Example:
> # Flink is deployed and is up and running
> # A TFX pipeline starts, submits a job to Flink along with a python whl of
> custom code and beam ops.
> # The beam worker installs the package and the pipeline finishes succesfully.
> # A new TFX pipeline is build where a new beam fn is introduced, the pipline
> is started and the new whl is submitted as in step 2).
> # This time, the new package is not being installed in the beam worker
> causing the job to fail due to a reference which does not exist in the beam
> worker, since it didn't install the new package.
>
> I started using Flink from beam version 2.27 and it has been an issue all the
> time.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)