calvinleungyk commented on pull request #15105: URL: https://github.com/apache/beam/pull/15105#issuecomment-899656839
@ihji We are using Dataflow and BEAM with mainly GCP services and GCS storage. The problem we are running into is that we could not upload remote packages to GCS to be used from Dataflow on Kubeflow Pipelines, and need to rebuild a custom image that contains the remote package every time, which works but is a hassle to our users. Since Dataflow needs to fetch staged artifacts either way, it seems logical to add support for that as well and leave network instability issues for the user; this feature was also supported previously on BEAM but removed at some point. If you think supporting non-GCS artifacts would be a bigger issue, we could scope it down to only GCS to reduce reliance on third-party services. @ibzib and I thought that it would be useful to generalize it in previous discussions. I have already created URL artifact information and populated with URL payload, but the blocker here is the support for that in `materialize.go`. The approach I'm using right now is to trace through how artifacts are materialized and it seems like I would have to add logic in `extractStagingToPath` so that it doesn't reject URL artifacts. Could you elaborate a bit more on "without being materialized during job submission"? Also, do you have an issue or tracker for the public Python Dataflow SDK harness work? Thanks for your help! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
