Re: Portable artifact retrieval using URLs

Steve Niemitz Tue, 02 Nov 2021 09:18:52 -0700

> but it was always the intent to have Dataflow support artifact types
beyond FILE.


Would you envision dataflow re-staging the artifacts to GCS, or just
passing them through?  In our case I'd rather the former, since it keeps a
job self-contained to GCS and not dependent on external resources when it
launches.

What's the appetite for a PR that changes the behavior of
artifact_service.maybe_store_artifact to download the URL artifact locally
first? (or changing the dataflow stager to do so).  From what I can tell
it's currently ~impossible to actually provide a URL resource through any
standard mechanism, so I doubt that'd be a breaking change.

Custom containers are certainly another option for us here too, but seem
more complicated to manage, I was hoping to start simple and move on to
them later when we really needed the complexity.

On Tue, Nov 2, 2021 at 12:12 PM Luke Cwik <[email protected]> wrote:

> What you are suggesting will work but it was always the intent to have
> Dataflow support artifact types beyond FILE.
>
> Another option would be to use custom containers with Dataflow and instead
> build the container embedding all (or most) of the artifacts that you need.
> This will help speed up how fast the workers start in Dataflow in addition
> to not having to workaround the fact that only FILE is supported.
>
> On Tue, Nov 2, 2021 at 8:03 AM Steve Niemitz <[email protected]> wrote:
>
>> We're working on running an "expansion service as a service" for xlang
>> transforms.  One of the things we'd really like is to serve the actual
>> required artifacts to the client (submitting the pipeline) from our
>> blobstore rather than streaming it through the artifact retrieval API
>> (GetArtifact).
>>
>> I have the expansion service returning the required artifact URLs when
>> expanding the transforms, but after that I've run into some confusion on
>> how it's supposed to work from there.
>>
>> Using the direct runner, the local artifact retrieval service [1] will
>> correctly download the URL resource. However, trying to run in dataflow
>> will cause the stager to break since it only supports FILE types [2].
>>
>> I _think_ what should happen is that in
>> artifact_service.maybe_store_artifact [3] we should download URL artifacts
>> locally and replace them with a temp file resource (similar to how
>> store_resource) works.  I made these changes locally and was able to then
>> submit jobs both via the direct runner and dataflow.
>>
>> Any thoughts here?
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/artifact_service.py#L76
>>
>> [2]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/internal/apiclient.py#L579
>>
>> [3]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/artifact_service.py#L284
>>
>

Re: Portable artifact retrieval using URLs

Reply via email to