[
https://issues.apache.org/jira/browse/BEAM-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360335#comment-17360335
]
Calvin Leung edited comment on BEAM-11275 at 6/9/21, 9:33 PM:
--------------------------------------------------------------
Hi [~ibzib] , in the old version of GCS downloading support, this function
downloads files from GCS:
{{ def _dependency_file_copy(from_path, to_path):}}
{{ if from_path.startswith('gs://') or to_path.startswith('gs://'):}}
{{ command_args = ['gsutil', '-m', '-q', 'cp', from_path, to_path]}}
{{ logging.info('Executing command: %s', command_args)}}
{{ result = processes.call(command_args)}}
If the Dataflow worker is run with a service account and the service account
has permission to download files from a private GCS bucket, would using
`gsutil` work as it infers the permissions from the service account?
Alternatively, would using [GCS Python API to download the
object|https://cloud.google.com/storage/docs/downloading-objects#code-samples]
be a good idea?
was (Author: calvinleungyk):
Hi [~ibzib] , in the old version of GCS downloading support, this function
downloads files from GCS:
```
def _dependency_file_copy(from_path, to_path):
if from_path.startswith('gs://') or to_path.startswith('gs://'):
command_args = ['gsutil', '-m', '-q', 'cp', from_path, to_path]
logging.info('Executing command: %s', command_args)
result = processes.call(command_args)
```
If the Dataflow worker is run with a service account and the service account
has permission to download files from a private GCS bucket, would using
`gsutil` work as it infers the permissions from the service account?
Alternatively, would using [GCS Python API to download the
object|https://cloud.google.com/storage/docs/downloading-objects#code-samples]
be a good idea?
> Support GCS files for extra_requirements argument in Python Beam portable
> runners
> ---------------------------------------------------------------------------------
>
> Key: BEAM-11275
> URL: https://issues.apache.org/jira/browse/BEAM-11275
> Project: Beam
> Issue Type: Improvement
> Components: sdk-py-core
> Reporter: Gerard Casas Saez
> Assignee: Calvin Leung
> Priority: P2
>
> Currently Portable runners only support locally available files for adding
> dependencies on remote workers. This can be seen in
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/stager.py#L429
> as it uses shutil.copyfile when it detects file is remote and it is not http.
> An easy extension would be to extend _is_remote_path in Stager to detect if
> the path matches any filesystem and if it does the avoid downloading and let
> it be copied afterwards.
> Acceptance criteria:
> - `extra_package` can be a GCS path instead of requiring it to be local only.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)