On Fri, Apr 26, 2019 at 4:14 PM Lukasz Cwik <lc...@google.com> wrote:
> We should stick with URN + payload + artifact metadata[1] where the only > mandatory one that all SDKs and expansion services understand is the > "bytes" artifact type. This allows us to add optional URNs for file://, > http://, Maven, PyPi, ... in the future. I would make the artifact > staging service use the same URN + payload mechanism to get compatibility > of artifacts across the different services and also have the artifact > staging service be able to be queried for the list of artifact types it > supports. Finally, we would need to have environments enumerate the > artifact types that they support. > > Having everyone have the same "artifact" representation would be > beneficial since: > a) Python environments could install dependencies from a requirements.txt > file (something that the Google Cloud Dataflow Python docker container > allows for today) > b) It provides an extensible and versioned mechanism for SDKs, > environments, and artifact staging/retrieval services to support additional > artifact types > c) Allow for expressing a canonical representation of an artifact like a > Maven package so a runner could merge environments that the runner deems > compatible. > > The flow I could see is: > 1) (optional) query artifact staging service for supported artifact types > 2) SDK request expansion service to expand transform passing in a list of > artifact types the SDK and artifact staging service support, the expansion > service returns a list of artifact types limited to those supported types + > any supported by the environment > 3) SDK converts any artifact types that the artifact staging service or > environment doesn't understand, e.g. pulls down Maven dependencies and > converts them to "bytes" artifacts > 4) SDK sends artifacts to artifact staging service > 5) Artifact staging service converts any artifacts to types that the > environment understands > 6) Environment is started and gets artifacts from the artifact retrieval > service. > This is a very interesting proposal. I would add: (5.5) artifact staging service resolves conflicts/duplicates for artifacts needed by different transforms of the same pipeline BTW what are the next steps here ? Heejong or Max, will one of you be able to come up with a detailed proposal around this ? In the meantime I suggest we add temporary pipeline options for staging Java dependencies from Python (and vice versa) to unblock development and testing of rest of the cross-language transforms stack. For example, https://github.com/apache/beam/pull/8340 Thanks, Cham > > On Wed, Apr 24, 2019 at 4:44 AM Robert Bradshaw <rober...@google.com> > wrote: > >> On Wed, Apr 24, 2019 at 12:21 PM Maximilian Michels <m...@apache.org> >> wrote: >> > >> > Good idea to let the client expose an artifact staging service that the >> > ExpansionService could use to stage artifacts. This solves two problems: >> > >> > (1) The Expansion Service not being able to access the Job Server >> > artifact staging service >> > (2) The client not having access to the dependencies returned by the >> > Expansion Server >> > >> > The downside is that it adds an additional indirection. The alternative >> > to let the client handle staging the artifacts returned by the Expansion >> > Server is more transparent and easier to implement. >> >> The other downside is that it may not always be possible for the >> expansion service to connect to the artifact staging service (e.g. >> when constructing a pipeline locally against a remote expansion >> service). >> > > Just to make sure, your saying the expansion service would return all the > artifacts (bytes, urls, ...) as part of the response since the expansion > service wouldn't be able to connect to the SDK that is running locally > either. > > >> > Ideally, the Expansion Service won't return any dependencies because the >> > environment already contains the required dependencies. We could make it >> > a requirement for the expansion to be performed inside an environment. >> > Then we would already ensure during expansion time that the runtime >> > dependencies are available. >> >> Yes, it's cleanest if the expansion service provides an environment >> without all the dependencies provided. Interesting idea to make this a >> property of the expansion service itself. >> > > I had thought this too but an opaque docker container that was built on > top of a base Beam docker container would be very difficult for a runner to > introspect and check to see if its compatible to allow for fusion across > PTransforms. I think artifacts need to be communicated in their canonical > representation. > > >> > > In this case, the runner would (as >> > > requested by its configuration) be free to merge environments it >> > > deemed compatible, including swapping out beam-java-X for >> > > beam-java-embedded if it considers itself compatible with the >> > > dependency list. >> > >> > Could you explain how that would work in practice? >> >> Say one has a pipeline with environments >> >> A: beam-java-sdk-2.12-docker >> B: beam-java-sdk-2.12-docker + dep1 >> C: beam-java-sdk-2.12-docker + dep2 >> D: beam-java-sdk-2.12-docker + dep3 >> >> A runner could (conceivably) be intelligent enough to know that dep1 >> and dep2 are indeed compatible, and run A, B, and C in a single >> beam-java-sdk-2.12-docker + dep1 + dep2 environment (with the >> corresponding fusion and lower overhead benefits). If a certain >> pipeline option is set, it might further note that dep1 and dep2 are >> compatible with its own workers, which are build against sdk-2.12, and >> choose to run these in embedded + dep1 + dep2 environment. >> > > We have been talking about the expansion service and cross language > transforms a lot lately but I believe it will initially come at the cost of > poor fusion of transforms since "merging" environments that are compatible > is a difficult problem since it brings up many of the dependency management > issues (e.g. diamond dependency issues). > > 1: > https://github.com/apache/beam/blob/516cdb6401d9fb7adb004de472771fb1fb3a92af/model/job-management/src/main/proto/beam_artifact_api.proto#L56 >