[ 
https://issues.apache.org/jira/browse/BEAM-11275?focusedWorklogId=638279&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-638279
 ]

ASF GitHub Bot logged work on BEAM-11275:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 16/Aug/21 16:43
            Start Date: 16/Aug/21 16:43
    Worklog Time Spent: 10m 
      Work Description: calvinleungyk commented on pull request #15105:
URL: https://github.com/apache/beam/pull/15105#issuecomment-899656839


   @ihji We are using Dataflow and BEAM with mainly GCP services and GCS 
storage. The problem we are running into is that we could not upload remote 
packages to GCS to be used from Dataflow on Kubeflow Pipelines, and need to 
rebuild a custom image that contains the remote package every time, which works 
but is a hassle to our users. Since Dataflow needs to fetch staged artifacts 
either way, it seems logical to add support for that as well and leave network 
instability issues for the user; this feature was also supported previously on 
BEAM but removed at some point. 
   
   If you think supporting non-GCS artifacts would be a bigger issue, we could 
scope it down to only GCS to reduce reliance on third-party services. @ibzib 
and I thought that it would be useful to generalize it in previous discussions.
   
   I have already created URL artifact information and populated with URL 
payload, but the blocker here is the support for that in `materialize.go`. The 
approach I'm using right now is to trace through how artifacts are materialized 
and it seems like I would have to add logic in `extractStagingToPath` so that 
it doesn't reject URL artifacts. Could you elaborate a bit more on "without 
being materialized during job submission"?
   
   Also, do you have an issue or tracker for the public Python Dataflow SDK 
harness work?
   
   Thanks for your help!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 638279)
    Time Spent: 7h 50m  (was: 7h 40m)

> Support GCS files for extra_requirements argument in Python Beam portable 
> runners
> ---------------------------------------------------------------------------------
>
>                 Key: BEAM-11275
>                 URL: https://issues.apache.org/jira/browse/BEAM-11275
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-core
>            Reporter: Gerard Casas Saez
>            Assignee: Calvin Leung
>            Priority: P2
>          Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> Currently Portable runners only support locally available files for adding 
> dependencies on remote workers. This can be seen in 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/stager.py#L429
>  as it uses shutil.copyfile when it detects file is remote and it is not http.
> An easy extension would be to extend _is_remote_path in Stager to detect if 
> the path matches any filesystem and if it does the avoid downloading and let 
> it be copied afterwards. 
> Acceptance criteria:
> - `extra_package` can be a GCS path instead of requiring it to be local only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to