[ 
https://issues.apache.org/jira/browse/BEAM-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322468#comment-17322468
 ] 

Calvin Leung commented on BEAM-11275:
-------------------------------------

Hi [~ibzib], I'm trying to defer downloading both GCS and HTTP files and the 
current approach is to check if the prefix is [http://,|http://%2C/] https:// 
or gcs://. If so, don't download/ stage the file and download them later in the 
worker instead. I'm looking into downloading files on the worker but I have a 
few questions. 
 # Is 
[sdk_container_builder|https://github.com/apache/beam/blob/6aac541d8f712a692349a642d72680144a6bb420/sdks/python/apache_beam/runners/portability/sdk_container_builder.py#L97]
 where the worker downloads the files from staging location when a job is 
invoked or is it somewhere else?
 # Does the worker already download everything on the ? I.e. for GCS path, does 
it suffice to just upload the extra packages to the {{staging_location}} and 
let the worker deal with it? If so, the GCS permissions issue doesn't seem to 
be concerning but I might be missing something here.
 # If I change artifact_types.FILE.urn to artifact_types.URL.urn in 
[create_file_stage_to_artifact|https://github.com/apache/beam/blob/a86dc0609f0b1bcc0c450979363b27b2657418af/sdks/python/apache_beam/runners/portability/stager.py#L121],
 does the URL gets automatically downloaded in the worker in later stages? 
Thinking if I need to keep track of the URLs and somehow propagate them and 
instruct the worker to download at initialization.

Thanks!

> Support GCS files for extra_requirements argument in Python Beam portable 
> runners
> ---------------------------------------------------------------------------------
>
>                 Key: BEAM-11275
>                 URL: https://issues.apache.org/jira/browse/BEAM-11275
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-core
>            Reporter: Gerard Casas Saez
>            Assignee: Calvin Leung
>            Priority: P2
>
> Currently Portable runners only support locally available files for adding 
> dependencies on remote workers. This can be seen in 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/stager.py#L429
>  as it uses shutil.copyfile when it detects file is remote and it is not http.
> An easy extension would be to extend _is_remote_path in Stager to detect if 
> the path matches any filesystem and if it does the avoid downloading and let 
> it be copied afterwards. 
> Acceptance criteria:
> - `extra_package` can be a GCS path instead of requiring it to be local only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to