[
https://issues.apache.org/jira/browse/BEAM-11275?focusedWorklogId=651453&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-651453
]
ASF GitHub Bot logged work on BEAM-11275:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 16/Sep/21 03:22
Start Date: 16/Sep/21 03:22
Worklog Time Spent: 10m
Work Description: aaltay commented on pull request #15105:
URL: https://github.com/apache/beam/pull/15105#issuecomment-920544932
I still do not think this is the right solution. The reason is:
Ideally user's would be responsible for their dependencies being frozen but
that is not the case, and that is not how user's are operating today. Even with
frozen dependencies it is possible for sub-dependencies to change, or even
dependencies change with the same version (as far as i know, pypi deos not
prevent this). Additionally, with the GCS staging folder changing dependencies
would require an active user action, with pypi it could chage without any user
interaction. (More issues, pypi does not have SLOs as high as GCS, and Dataflow
pipelines failing at startup because of pypi dependencies is common user issue.)
I understand the use you are describing. I would think that custom
containers would actually result in a faster experience. Note that with custom
containers that containers is built only once in the local machines, without
that the same container will be built N times, one for each worker.
The suggestion to add support for remote package to the extra_package sounds
good to me. I actually thought this is already the case (see: [1]). Is it not
working? It seems like it does not support gs:// scheme but supports https://.
You could add gs:// support there, or I believe you could get https url for
files that exists on GCS and use that instead.
[1]
https://github.com/apache/beam/blob/abfd4c662a6701feef078c36551c15bf4303eef4/sdks/python/apache_beam/runners/portability/stager.py#L577
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 651453)
Time Spent: 9h (was: 8h 50m)
> Support GCS files for extra_requirements argument in Python Beam portable
> runners
> ---------------------------------------------------------------------------------
>
> Key: BEAM-11275
> URL: https://issues.apache.org/jira/browse/BEAM-11275
> Project: Beam
> Issue Type: Improvement
> Components: sdk-py-core
> Reporter: Gerard Casas Saez
> Assignee: Calvin Leung
> Priority: P2
> Time Spent: 9h
> Remaining Estimate: 0h
>
> Currently Portable runners only support locally available files for adding
> dependencies on remote workers. This can be seen in
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/stager.py#L429
> as it uses shutil.copyfile when it detects file is remote and it is not http.
> An easy extension would be to extend _is_remote_path in Stager to detect if
> the path matches any filesystem and if it does the avoid downloading and let
> it be copied afterwards.
> Acceptance criteria:
> - `extra_package` can be a GCS path instead of requiring it to be local only.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)