[
https://issues.apache.org/jira/browse/BEAM-11275?focusedWorklogId=650520&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-650520
]
ASF GitHub Bot logged work on BEAM-11275:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 14/Sep/21 11:44
Start Date: 14/Sep/21 11:44
Worklog Time Spent: 10m
Work Description: calvinleungyk commented on pull request #15105:
URL: https://github.com/apache/beam/pull/15105#issuecomment-919072236
For 1., I would argue that it's the user's responsibility to ensure the
pipeline is reading a consistent set of artifacts. If the user doesn't freeze
package version in Pythons requirements.txt, they will get inconsistent
libraries downloaded across regular Python job invocation. Even for the GCS
staging location here, it's still overwritable and nothing much is done to
ensure the artifacts are consistent across pipeline runs.
Custom containers could be a feasible solution and we are currently looking
into it. The downside to this is users need to re-build a container every time
they change or add dependencies/ files and this does not provide the best user
experience (while being flexible). For the `setup.py` approach, users would
need to learn how to write and structure the `setup.py` file before they can
start testing with the pipeline, which increases the overhead and introduces
friction for rapid experimentation.
For our use case, users compile a TFX pipeline (that uses Beam) on a local
machine with `extra_packages` and then send that to a remote machine in a
Kubeflow cluster. When the Kubeflow machine runs the pipeline, it only has the
pipeline but not the `extra_packages` files. As `extra_packages` only support
local paths, the job that is launched on remote Kubeflow machine fails. In
Dataflow runner's case, GCS buckets are already used as staging locations so it
doesn't seem such a big change to defer GCS downloads to Dataflow workers.
Alternatively, if we don't want to defer GCS downloads to Dataflow workers,
we can use an approach similar to what was originally proposed in
https://issues.apache.org/jira/browse/BEAM-11275, to support GCS path in
[stager.py](https://github.com/apache/beam/blob/92aebe4d8837b6c5a598acc489e14c72348acd8c/sdks/python/apache_beam/runners/portability/stager.py#L504)
so that the remote runner can download the package from a GCS path (not just
local) and upload to staging bucket.
Compared to rebuilding containers multiple times or having our users learn
how to write and structure a `setup.py` properly, this provides the convenience
that matches the existing user experience, so ideally we can merge this change
for support GCS paths somewhere.
Would be great if @aaltay and @ibzib could review this proposal. Thank you
very much!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 650520)
Time Spent: 8h 50m (was: 8h 40m)
> Support GCS files for extra_requirements argument in Python Beam portable
> runners
> ---------------------------------------------------------------------------------
>
> Key: BEAM-11275
> URL: https://issues.apache.org/jira/browse/BEAM-11275
> Project: Beam
> Issue Type: Improvement
> Components: sdk-py-core
> Reporter: Gerard Casas Saez
> Assignee: Calvin Leung
> Priority: P2
> Time Spent: 8h 50m
> Remaining Estimate: 0h
>
> Currently Portable runners only support locally available files for adding
> dependencies on remote workers. This can be seen in
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/stager.py#L429
> as it uses shutil.copyfile when it detects file is remote and it is not http.
> An easy extension would be to extend _is_remote_path in Stager to detect if
> the path matches any filesystem and if it does the avoid downloading and let
> it be copied afterwards.
> Acceptance criteria:
> - `extra_package` can be a GCS path instead of requiring it to be local only.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)