> but it was always the intent to have Dataflow support artifact types beyond FILE.
Would you envision dataflow re-staging the artifacts to GCS, or just passing them through? In our case I'd rather the former, since it keeps a job self-contained to GCS and not dependent on external resources when it launches. What's the appetite for a PR that changes the behavior of artifact_service.maybe_store_artifact to download the URL artifact locally first? (or changing the dataflow stager to do so). From what I can tell it's currently ~impossible to actually provide a URL resource through any standard mechanism, so I doubt that'd be a breaking change. Custom containers are certainly another option for us here too, but seem more complicated to manage, I was hoping to start simple and move on to them later when we really needed the complexity. On Tue, Nov 2, 2021 at 12:12 PM Luke Cwik <[email protected]> wrote: > What you are suggesting will work but it was always the intent to have > Dataflow support artifact types beyond FILE. > > Another option would be to use custom containers with Dataflow and instead > build the container embedding all (or most) of the artifacts that you need. > This will help speed up how fast the workers start in Dataflow in addition > to not having to workaround the fact that only FILE is supported. > > On Tue, Nov 2, 2021 at 8:03 AM Steve Niemitz <[email protected]> wrote: > >> We're working on running an "expansion service as a service" for xlang >> transforms. One of the things we'd really like is to serve the actual >> required artifacts to the client (submitting the pipeline) from our >> blobstore rather than streaming it through the artifact retrieval API >> (GetArtifact). >> >> I have the expansion service returning the required artifact URLs when >> expanding the transforms, but after that I've run into some confusion on >> how it's supposed to work from there. >> >> Using the direct runner, the local artifact retrieval service [1] will >> correctly download the URL resource. However, trying to run in dataflow >> will cause the stager to break since it only supports FILE types [2]. >> >> I _think_ what should happen is that in >> artifact_service.maybe_store_artifact [3] we should download URL artifacts >> locally and replace them with a temp file resource (similar to how >> store_resource) works. I made these changes locally and was able to then >> submit jobs both via the direct runner and dataflow. >> >> Any thoughts here? >> >> [1] >> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/artifact_service.py#L76 >> >> [2] >> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/internal/apiclient.py#L579 >> >> [3] >> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/artifact_service.py#L284 >> >
