We should stick with URN + payload + artifact metadata[1] where the only
mandatory one that all SDKs and expansion services understand is the
"bytes" artifact type. This allows us to add optional URNs for file://,
http://, Maven, PyPi, ... in the future. I would make the artifact staging
service use the same URN + payload mechanism to get compatibility of
artifacts across the different services and also have the artifact staging
service be able to be queried for the list of artifact types it supports.
Finally, we would need to have environments enumerate the artifact types
that they support.

Having everyone have the same "artifact" representation would be beneficial
since:
a) Python environments could install dependencies from a requirements.txt
file (something that the Google Cloud Dataflow Python docker container
allows for today)
b) It provides an extensible and versioned mechanism for SDKs,
environments, and artifact staging/retrieval services to support additional
artifact types
c) Allow for expressing a canonical representation of an artifact like a
Maven package so a runner could merge environments that the runner deems
compatible.

The flow I could see is:
1) (optional) query artifact staging service for supported artifact types
2) SDK request expansion service to expand transform passing in a list of
artifact types the SDK and artifact staging service support, the expansion
service returns a list of artifact types limited to those supported types +
any supported by the environment
3) SDK converts any artifact types that the artifact staging service or
environment doesn't understand, e.g. pulls down Maven dependencies and
converts them to "bytes" artifacts
4) SDK sends artifacts to artifact staging service
5) Artifact staging service converts any artifacts to types that the
environment understands
6) Environment is started and gets artifacts from the artifact retrieval
service.

On Wed, Apr 24, 2019 at 4:44 AM Robert Bradshaw <rober...@google.com> wrote:

> On Wed, Apr 24, 2019 at 12:21 PM Maximilian Michels <m...@apache.org>
> wrote:
> >
> > Good idea to let the client expose an artifact staging service that the
> > ExpansionService could use to stage artifacts. This solves two problems:
> >
> > (1) The Expansion Service not being able to access the Job Server
> > artifact staging service
> > (2) The client not having access to the dependencies returned by the
> > Expansion Server
> >
> > The downside is that it adds an additional indirection. The alternative
> > to let the client handle staging the artifacts returned by the Expansion
> > Server is more transparent and easier to implement.
>
> The other downside is that it may not always be possible for the
> expansion service to connect to the artifact staging service (e.g.
> when constructing a pipeline locally against a remote expansion
> service).
>

Just to make sure, your saying the expansion service would return all the
artifacts (bytes, urls, ...) as part of the response since the expansion
service wouldn't be able to connect to the SDK that is running locally
either.


> > Ideally, the Expansion Service won't return any dependencies because the
> > environment already contains the required dependencies. We could make it
> > a requirement for the expansion to be performed inside an environment.
> > Then we would already ensure during expansion time that the runtime
> > dependencies are available.
>
> Yes, it's cleanest if the expansion service provides an environment
> without all the dependencies provided. Interesting idea to make this a
> property of the expansion service itself.
>

I had thought this too but an opaque docker container that was built on top
of a base Beam docker container would be very difficult for a runner to
introspect and check to see if its compatible to allow for fusion across
PTransforms. I think artifacts need to be communicated in their canonical
representation.


> > > In this case, the runner would (as
> > > requested by its configuration) be free to merge environments it
> > > deemed compatible, including swapping out beam-java-X for
> > > beam-java-embedded if it considers itself compatible with the
> > > dependency list.
> >
> > Could you explain how that would work in practice?
>
> Say one has a pipeline with environments
>
> A: beam-java-sdk-2.12-docker
> B: beam-java-sdk-2.12-docker + dep1
> C: beam-java-sdk-2.12-docker + dep2
> D: beam-java-sdk-2.12-docker + dep3
>
> A runner could (conceivably) be intelligent enough to know that dep1
> and dep2 are indeed compatible, and run A, B, and C in a single
> beam-java-sdk-2.12-docker + dep1 + dep2 environment (with the
> corresponding fusion and lower overhead benefits). If a certain
> pipeline option is set, it might further note that dep1 and dep2 are
> compatible with its own workers, which are build against sdk-2.12, and
> choose to run these in embedded + dep1 + dep2 environment.
>

We have been talking about the expansion service and cross language
transforms a lot lately but I believe it will initially come at the cost of
poor fusion of transforms since "merging" environments that are compatible
is a difficult problem since it brings up many of the dependency management
issues (e.g. diamond dependency issues).

1:
https://github.com/apache/beam/blob/516cdb6401d9fb7adb004de472771fb1fb3a92af/model/job-management/src/main/proto/beam_artifact_api.proto#L56

Reply via email to