On Fri, Apr 26, 2019 at 4:14 PM Lukasz Cwik <lc...@google.com> wrote:

> We should stick with URN + payload + artifact metadata[1] where the only
> mandatory one that all SDKs and expansion services understand is the
> "bytes" artifact type. This allows us to add optional URNs for file://,
> http://, Maven, PyPi, ... in the future. I would make the artifact
> staging service use the same URN + payload mechanism to get compatibility
> of artifacts across the different services and also have the artifact
> staging service be able to be queried for the list of artifact types it
> supports. Finally, we would need to have environments enumerate the
> artifact types that they support.
>
> Having everyone have the same "artifact" representation would be
> beneficial since:
> a) Python environments could install dependencies from a requirements.txt
> file (something that the Google Cloud Dataflow Python docker container
> allows for today)
> b) It provides an extensible and versioned mechanism for SDKs,
> environments, and artifact staging/retrieval services to support additional
> artifact types
> c) Allow for expressing a canonical representation of an artifact like a
> Maven package so a runner could merge environments that the runner deems
> compatible.
>
> The flow I could see is:
> 1) (optional) query artifact staging service for supported artifact types
> 2) SDK request expansion service to expand transform passing in a list of
> artifact types the SDK and artifact staging service support, the expansion
> service returns a list of artifact types limited to those supported types +
> any supported by the environment
> 3) SDK converts any artifact types that the artifact staging service or
> environment doesn't understand, e.g. pulls down Maven dependencies and
> converts them to "bytes" artifacts
> 4) SDK sends artifacts to artifact staging service
> 5) Artifact staging service converts any artifacts to types that the
> environment understands
> 6) Environment is started and gets artifacts from the artifact retrieval
> service.
>

This is a very interesting proposal. I would add:
(5.5) artifact staging service resolves conflicts/duplicates for artifacts
needed by different transforms of the same pipeline

BTW what are the next steps here ? Heejong or Max, will one of you be able
to come up with a detailed proposal around this ?

In the meantime I suggest we add temporary pipeline options for staging
Java dependencies from Python (and vice versa) to unblock development and
testing of rest of the cross-language transforms stack. For example,
https://github.com/apache/beam/pull/8340

Thanks,
Cham


>
> On Wed, Apr 24, 2019 at 4:44 AM Robert Bradshaw <rober...@google.com>
> wrote:
>
>> On Wed, Apr 24, 2019 at 12:21 PM Maximilian Michels <m...@apache.org>
>> wrote:
>> >
>> > Good idea to let the client expose an artifact staging service that the
>> > ExpansionService could use to stage artifacts. This solves two problems:
>> >
>> > (1) The Expansion Service not being able to access the Job Server
>> > artifact staging service
>> > (2) The client not having access to the dependencies returned by the
>> > Expansion Server
>> >
>> > The downside is that it adds an additional indirection. The alternative
>> > to let the client handle staging the artifacts returned by the Expansion
>> > Server is more transparent and easier to implement.
>>
>> The other downside is that it may not always be possible for the
>> expansion service to connect to the artifact staging service (e.g.
>> when constructing a pipeline locally against a remote expansion
>> service).
>>
>
> Just to make sure, your saying the expansion service would return all the
> artifacts (bytes, urls, ...) as part of the response since the expansion
> service wouldn't be able to connect to the SDK that is running locally
> either.
>
>
>> > Ideally, the Expansion Service won't return any dependencies because the
>> > environment already contains the required dependencies. We could make it
>> > a requirement for the expansion to be performed inside an environment.
>> > Then we would already ensure during expansion time that the runtime
>> > dependencies are available.
>>
>> Yes, it's cleanest if the expansion service provides an environment
>> without all the dependencies provided. Interesting idea to make this a
>> property of the expansion service itself.
>>
>
> I had thought this too but an opaque docker container that was built on
> top of a base Beam docker container would be very difficult for a runner to
> introspect and check to see if its compatible to allow for fusion across
> PTransforms. I think artifacts need to be communicated in their canonical
> representation.
>
>
>> > > In this case, the runner would (as
>> > > requested by its configuration) be free to merge environments it
>> > > deemed compatible, including swapping out beam-java-X for
>> > > beam-java-embedded if it considers itself compatible with the
>> > > dependency list.
>> >
>> > Could you explain how that would work in practice?
>>
>> Say one has a pipeline with environments
>>
>> A: beam-java-sdk-2.12-docker
>> B: beam-java-sdk-2.12-docker + dep1
>> C: beam-java-sdk-2.12-docker + dep2
>> D: beam-java-sdk-2.12-docker + dep3
>>
>> A runner could (conceivably) be intelligent enough to know that dep1
>> and dep2 are indeed compatible, and run A, B, and C in a single
>> beam-java-sdk-2.12-docker + dep1 + dep2 environment (with the
>> corresponding fusion and lower overhead benefits). If a certain
>> pipeline option is set, it might further note that dep1 and dep2 are
>> compatible with its own workers, which are build against sdk-2.12, and
>> choose to run these in embedded + dep1 + dep2 environment.
>>
>
> We have been talking about the expansion service and cross language
> transforms a lot lately but I believe it will initially come at the cost of
> poor fusion of transforms since "merging" environments that are compatible
> is a difficult problem since it brings up many of the dependency management
> issues (e.g. diamond dependency issues).
>
> 1:
> https://github.com/apache/beam/blob/516cdb6401d9fb7adb004de472771fb1fb3a92af/model/job-management/src/main/proto/beam_artifact_api.proto#L56
>

Reply via email to