[
https://issues.apache.org/jira/browse/BEAM-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325256#comment-17325256
]
Kyle Weaver commented on BEAM-11275:
------------------------------------
# There are two steps, artifact staging (the client that submits the job
uploads artifacts to the staging location) and artifact retrieval (workers
download artifacts from the staging location).
# Assuming you meant "Does the worker already download everything on the
staging_location?" - no. The worker only downloads the requested artifacts.
# "If I change artifact_types.FILE.urn to artifact_types.URL.urn in
[create_file_stage_to_artifact|https://github.com/apache/beam/blob/a86dc0609f0b1bcc0c450979363b27b2657418af/sdks/python/apache_beam/runners/portability/stager.py#L121],
does the URL gets automatically downloaded in the worker in later stages?" -
yes. Take a look at the artifact retrieval code:
[https://github.com/apache/beam/blob/e0136ffc176d157d0928e7d501bca4daca3160a8/sdks/python/apache_beam/runners/portability/artifact_service.py#L81-L85]
Note that it uses urllib to download files though, which as far as I know
doesn't support GCS.
> Support GCS files for extra_requirements argument in Python Beam portable
> runners
> ---------------------------------------------------------------------------------
>
> Key: BEAM-11275
> URL: https://issues.apache.org/jira/browse/BEAM-11275
> Project: Beam
> Issue Type: Improvement
> Components: sdk-py-core
> Reporter: Gerard Casas Saez
> Assignee: Calvin Leung
> Priority: P2
>
> Currently Portable runners only support locally available files for adding
> dependencies on remote workers. This can be seen in
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/stager.py#L429
> as it uses shutil.copyfile when it detects file is remote and it is not http.
> An easy extension would be to extend _is_remote_path in Stager to detect if
> the path matches any filesystem and if it does the avoid downloading and let
> it be copied afterwards.
> Acceptance criteria:
> - `extra_package` can be a GCS path instead of requiring it to be local only.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)