[ 
https://issues.apache.org/jira/browse/BEAM-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360335#comment-17360335
 ] 

Calvin Leung edited comment on BEAM-11275 at 6/9/21, 9:33 PM:
--------------------------------------------------------------

Hi [~ibzib] , in the old version of GCS downloading support, this function 
downloads files from GCS:
{{ def _dependency_file_copy(from_path, to_path):}}
{{     if from_path.startswith('gs://') or to_path.startswith('gs://'):}}
{{         command_args = ['gsutil', '-m', '-q', 'cp', from_path, to_path]}}
{{         logging.info('Executing command: %s', command_args)}}
{{         result = processes.call(command_args)}}

If the Dataflow worker is run with a service account and the service account 
has permission to download files from a private GCS bucket, would using 
`gsutil` work as it infers the permissions from the service account?

Alternatively, would using [GCS Python API to download the 
object|https://cloud.google.com/storage/docs/downloading-objects#code-samples] 
be a good idea?


was (Author: calvinleungyk):
Hi [~ibzib] , in the old version of GCS downloading support, this function 
downloads files from GCS:

```
 def _dependency_file_copy(from_path, to_path):
     if from_path.startswith('gs://') or to_path.startswith('gs://'):
         command_args = ['gsutil', '-m', '-q', 'cp', from_path, to_path]
         logging.info('Executing command: %s', command_args)
         result = processes.call(command_args)

```

If the Dataflow worker is run with a service account and the service account 
has permission to download files from a private GCS bucket, would using 
`gsutil` work as it infers the permissions from the service account?

Alternatively, would using [GCS Python API to download the 
object|https://cloud.google.com/storage/docs/downloading-objects#code-samples] 
be a good idea?

> Support GCS files for extra_requirements argument in Python Beam portable 
> runners
> ---------------------------------------------------------------------------------
>
>                 Key: BEAM-11275
>                 URL: https://issues.apache.org/jira/browse/BEAM-11275
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-core
>            Reporter: Gerard Casas Saez
>            Assignee: Calvin Leung
>            Priority: P2
>
> Currently Portable runners only support locally available files for adding 
> dependencies on remote workers. This can be seen in 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/stager.py#L429
>  as it uses shutil.copyfile when it detects file is remote and it is not http.
> An easy extension would be to extend _is_remote_path in Stager to detect if 
> the path matches any filesystem and if it does the avoid downloading and let 
> it be copied afterwards. 
> Acceptance criteria:
> - `extra_package` can be a GCS path instead of requiring it to be local only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to