Re: Artifact staging in cross-language pipelines

Robert Bradshaw Tue, 23 Apr 2019 02:08:15 -0700

I've been out, so coming a bit late to the discussion, but here's my thoughts.


The expansion service absolutely needs to be able to provide the
dependencies for the transform(s) it expands. It seems the default,
foolproof way of doing this is via the environment, which can be a
docker image with all the required dependencies. More than this an
(arguably important, but possibly messy) optimization.

The standard way to provide artifacts outside of the environment is
via the artifact staging service. Of course, the expansion service may
not have access to the (final) artifact staging service (due to
permissions, locality, or it may not even be started up yet) but the
SDK invoking the expansion service could offer an artifact staging
environment for the SDK to publish artifacts to. However, there are
some difficulties here, in particular avoiding name collisions with
staged artifacts, assigning semantic meaning to the artifacts (e.g.
should jar files get automatically placed in the classpath, or Python
packages recognized and installed at startup). The alternative is
going with a (type, pointer) scheme for naming dependencies; if we go
this route I think we should consider migrating all artifact staging
to this style. I am concerned that the "file" version will be less
than useful for what will become the most convenient expansion
services (namely, hosted and docker image). I am still at a loss,
however, as to how to solve the diamond dependency problem among
dependencies--perhaps the information is there if one walks
maven/pypi/go modules/... but do we expect every runner to know about
every packaging platform? This also wouldn't solve the issue if fat
jars are used as dependencies. The only safe thing to do here is to
force distinct dependency sets to live in different environments,
which could be too conservative.

This all leads me to think that perhaps the environment itself should
be docker image (often one of "vanilla" beam-java-x.y ones) +
dependency list, rather than have the dependency/artifact list as some
kind of data off to the side. In this case, the runner would (as
requested by its configuration) be free to merge environments it
deemed compatible, including swapping out beam-java-X for
beam-java-embedded if it considers itself compatible with the
dependency list.

I agree with Thomas that we'll want to make expansion services, and
the transforms they offer, more discoverable. The whole lifetime cycle
of expansion services is something that has yet to be fully fleshed
out, and may influence some of these decisions.

As for adding --jar_package to the Python SDK, this seems really
specific to calling java-from-python (would we have O(n^2) such
options?) as well as out-of-place for a Python user to specify. I
would really hope we can figure out a more generic solution. If we
need this option in the meantime, let's at least make it clear
(probably in the name) that it's temporary.

On Tue, Apr 23, 2019 at 1:08 AM Thomas Weise <t...@apache.org> wrote:
>
> One more suggestion:
>
> It would be nice to be able to select the environment for the external 
> transforms. For example, I would like to be able to use EMBEDDED for Flink. 
> That's implicit for sources which are runner native unbounded read 
> translations, but it should also be possible for writes. That would then be 
> similar to how pipelines are packaged and run with the "legacy" runner.
>
> Thomas
>
>
> On Mon, Apr 22, 2019 at 1:18 PM Ankur Goenka <goe...@google.com> wrote:
>>
>> Great discussion!
>> I have a few points around the structure of proto but that is less important 
>> as it can evolve.
>> However, I think that artifact compatibility is another important aspect to 
>> look at.
>> Example: TransformA uses Guava 1.6>< 1.7, TransformB uses 1.8><1.9 and 
>> TransformC uses 1.6><1.8. As sdk provide the environment for each transform, 
>> it can not simply say EnvironmentJava for both TransformA and TransformB as 
>> the dependencies are not compatible.
>> We should have separate environment associated with TransformA and 
>> TransformB in this case.
>>
>> To support this case, we need 2 things.
>> 1: Granular metadata about the dependency including type.
>> 2: Complete list of the transforms to be expanded.
>>
>> Elaboration:
>> The compatibility check can be done in a crude way if we provide all the 
>> metadata about the dependency to expansion service.
>> Also, the expansion service should expand all the applicable transforms in a 
>> single call so that it knows about incompatibility and create separate 
>> environments for these transforms. So in the above example, expansion 
>> service will associate EnvA to TransformA and EnvB to TransformB and EnvA to 
>> TransformC. This will ofcource require changes to Expansion service proto 
>> but giving all the information to expansion service will make it support 
>> more case and make it a bit more future proof.
>>
>>
>> On Mon, Apr 22, 2019 at 10:16 AM Maximilian Michels <m...@apache.org> wrote:
>>>
>>> Thanks for the summary Cham. All makes sense. I agree that we want to
>>> keep the option to manually specify artifacts.
>>>
>>> > There are few unanswered questions though.
>>> > (1) In what form will a transform author specify dependencies ? For 
>>> > example, URL to a Maven repo, URL to a local file, blob ?
>>>
>>> Going forward, we probably want to support multiple ways. For now, we
>>> could stick with a URL-based approach with support for different file
>>> systems. In the future a list of packages to retrieve from Maven/PyPi
>>> would be useful.
>>>
>> We can ask user for (type, metadata). For maven it can be something like 
>> (MAVEN, {groupId:com.google.guava, artifactId: guava, version: 19}) or 
>> (FILE, file://myfile)
>> To begin with, we can only support a few types like File and can add more 
>> types in future.
>>>
>>> > (2) How will dependencies be included in the expansion response proto ? 
>>> > String (URL), bytes (blob) ?
>>>
>>> I'd go for a list of Protobuf strings first but the format would have to
>>> evolve for other dependency types.
>>>
>> Here also (type, payload) should suffice. We can have interpreter for each 
>> type to translate the payload.
>>>
>>> > (3) How will we manage/share transitive dependencies required at runtime ?
>>>
>>> I'd say transitive dependencies have to be included in the list. In case
>>> of fat jars, they are reduced to a single jar.
>>
>> Makes sense.
>>>
>>>
>>> > (4) How will dependencies be staged for various runner/SDK combinations ? 
>>> > (for example, portable runner/Flink, Dataflow runner)
>>>
>>> Staging should be no different than it is now, i.e. go through Beam's
>>> artifact staging service. As long as the protocol is stable, there could
>>> also be different implementations.
>>
>> Makes sense.
>>>
>>>
>>> -Max
>>>
>>> On 20.04.19 03:08, Chamikara Jayalath wrote:
>>> > OK, sounds like this is a good path forward then.
>>> >
>>> > * When starting up the expansion service, user (that starts up the
>>> > service) provide dependencies necessary to expand transforms. We will
>>> > later add support for adding new transforms to an already running
>>> > expansion service.
>>> > * As a part of transform configuration, transform author have the option
>>> > of providing a list of dependencies that will be needed to run the
>>> > transform.
>>> > * These dependencies will be send back to the pipeline SDK as a part of
>>> > expansion response and pipeline SDK will stage these resources.
>>> > * Pipeline author have the option of specifying the dependencies using a
>>> > pipeline option. (for example, https://github.com/apache/beam/pull/8340)
>>> >
>>> > I think last option is important to (1) make existing transform easily
>>> > available for cross-language usage without additional configurations (2)
>>> > allow pipeline authors to override dependency versions specified by in
>>> > the transform configuration (for example, to apply security patches)
>>> > without updating the expansion service.
>>> >
>>> > There are few unanswered questions though.
>>> > (1) In what form will a transform author specify dependencies ? For
>>> > example, URL to a Maven repo, URL to a local file, blob ?
>>> > (2) How will dependencies be included in the expansion response proto ?
>>> > String (URL), bytes (blob) ?
>>> > (3) How will we manage/share transitive dependencies required at runtime ?
>>> > (4) How will dependencies be staged for various runner/SDK combinations
>>> > ? (for example, portable runner/Flink, Dataflow runner)
>>> >
>>> > Thanks,
>>> > Cham
>>> >
>>> > On Fri, Apr 19, 2019 at 4:49 AM Maximilian Michels <m...@apache.org
>>> > <mailto:m...@apache.org>> wrote:
>>> >
>>> >     Thank you for your replies.
>>> >
>>> >     I did not suggest that the Expansion Service does the staging, but it
>>> >     would return the required resources (e.g. jars) for the external
>>> >     transform's runtime environment. The client then has to take care of
>>> >     staging the resources.
>>> >
>>> >     The Expansion Service itself also needs resources to do the
>>> >     expansion. I
>>> >     assumed those to be provided when starting the expansion service. I
>>> >     consider it less important but we could also provide a way to add new
>>> >     transforms to the Expansion Service after startup.
>>> >
>>> >     Good point on Docker vs externally provided environments. For the PR
>>> >     [1]
>>> >     it will suffice then to add Kafka to the container dependencies. The
>>> >     "--jar_package" pipeline option is ok for now but I'd like to see work
>>> >     towards staging resources for external transforms via information
>>> >     returned by the Expansion Service. That avoids users having to take
>>> >     care
>>> >     of including the correct jars in their pipeline options.
>>> >
>>> >     These issues are related and we could discuss them in separate 
>>> > threads:
>>> >
>>> >     * Auto-discovery of Expansion Service and its external transforms
>>> >     * Credentials required during expansion / runtime
>>> >
>>> >     Thanks,
>>> >     Max
>>> >
>>> >     [1] ttps://github.com/apache/beam/pull/8322
>>> >     <http://github.com/apache/beam/pull/8322>
>>> >
>>> >     On 19.04.19 07:35, Thomas Weise wrote:
>>> >      > Good discussion :)
>>> >      >
>>> >      > Initially the expansion service was considered a user
>>> >     responsibility,
>>> >      > but I think that isn't necessarily the case. I can also see the
>>> >      > expansion service provided as part of the infrastructure and the
>>> >     user
>>> >      > not wanting to deal with it at all. For example, users may want
>>> >     to write
>>> >      > Python transforms and use external IOs, without being concerned how
>>> >      > these IOs are provided. Under such scenario it would be good if:
>>> >      >
>>> >      > * Expansion service(s) can be auto-discovered via the job service
>>> >     endpoint
>>> >      > * Available external transforms can be discovered via the expansion
>>> >      > service(s)
>>> >      > * Dependencies for external transforms are part of the metadata
>>> >     returned
>>> >      > by expansion service
>>> >      >
>>> >      > Dependencies could then be staged either by the SDK client or the
>>> >      > expansion service. The expansion service could provide the
>>> >     locations to
>>> >      > stage to the SDK, it would still be transparent to the user.
>>> >      >
>>> >      > I also agree with Luke regarding the environments. Docker is the
>>> >     choice
>>> >      > for generic deployment. Other environments are used when the
>>> >     flexibility
>>> >      > offered by Docker isn't needed (or gets into the way). Then the
>>> >      > dependencies are provided in different ways. Whether these are
>>> >     Python
>>> >      > packages or jar files, by opting out of Docker the decision is
>>> >     made to
>>> >      > manage dependencies externally.
>>> >      >
>>> >      > Thomas
>>> >      >
>>> >      >
>>> >      > On Thu, Apr 18, 2019 at 6:01 PM Chamikara Jayalath
>>> >     <chamik...@google.com <mailto:chamik...@google.com>
>>> >      > <mailto:chamik...@google.com <mailto:chamik...@google.com>>> wrote:
>>> >      >
>>> >      >
>>> >      >
>>> >      >     On Thu, Apr 18, 2019 at 5:21 PM Chamikara Jayalath
>>> >      >     <chamik...@google.com <mailto:chamik...@google.com>
>>> >     <mailto:chamik...@google.com <mailto:chamik...@google.com>>> wrote:
>>> >      >
>>> >      >         Thanks for raising the concern about credentials Ankur, I
>>> >     agree
>>> >      >         that this is a significant issue.
>>> >      >
>>> >      >         On Thu, Apr 18, 2019 at 4:23 PM Lukasz Cwik
>>> >     <lc...@google.com <mailto:lc...@google.com>
>>> >      >         <mailto:lc...@google.com <mailto:lc...@google.com>>> wrote:
>>> >      >
>>> >      >             I can understand the concern about credentials, the 
>>> > same
>>> >      >             access concern will exist for several cross language
>>> >      >             transforms (mostly IOs) since some will need access to
>>> >      >             credentials to read/write to an external service.
>>> >      >
>>> >      >             Are there any ideas on how credential propagation
>>> >     could work
>>> >      >             to these IOs?
>>> >      >
>>> >      >
>>> >      >         There are some cases where existing IO transforms need
>>> >      >         credentials to access remote resources, for example, size
>>> >      >         estimation, validation, etc. But usually these are
>>> >     optional (or
>>> >      >         transform can be configured to not perform these 
>>> > functions).
>>> >      >
>>> >      >
>>> >      >     To clarify, I'm only talking about transform expansion here.
>>> >     Many IO
>>> >      >     transforms need read/write access to remote services at run
>>> >     time. So
>>> >      >     probably we need to figure out a way to propagate these
>>> >     credentials
>>> >      >     anyways.
>>> >      >
>>> >      >             Can we use these mechanisms for staging?
>>> >      >
>>> >      >
>>> >      >         I think we'll have to find a way to do one of (1) propagate
>>> >      >         credentials to other SDKs (2) allow users to configure SDK
>>> >      >         containers to have necessary credentials (3) do the 
>>> > artifact
>>> >      >         staging from the pipeline SDK environment which already 
>>> > have
>>> >      >         credentials. I prefer (1) or (2) since this will given a
>>> >      >         transform same feature set whether used directly (in the 
>>> > same
>>> >      >         SDK language as the transform) or remotely but it might
>>> >     be hard
>>> >      >         to do this for an arbitrary service that a transform might
>>> >      >         connect to considering the number of ways users can 
>>> > configure
>>> >      >         credentials (after an offline discussion with Ankur).
>>> >      >
>>> >      >
>>> >      >             On Thu, Apr 18, 2019 at 3:47 PM Ankur Goenka
>>> >      >             <goe...@google.com <mailto:goe...@google.com>
>>> >     <mailto:goe...@google.com <mailto:goe...@google.com>>> wrote:
>>> >      >
>>> >      >                 I agree that the Expansion service knows about the
>>> >      >                 artifacts required for a cross language transform 
>>> > and
>>> >      >                 having a prepackage folder/Zip for transforms
>>> >     based on
>>> >      >                 language makes sense.
>>> >      >
>>> >      >                 One think to note here is that expansion service
>>> >     might
>>> >      >                 not have the same access privilege as the pipeline
>>> >      >                 author and hence might not be able to stage
>>> >     artifacts by
>>> >      >                 itself.
>>> >      >                 Keeping this in mind I am leaning towards making
>>> >      >                 Expansion service provide all the required
>>> >     artifacts to
>>> >      >                 the user and let the user stage the artifacts as
>>> >     regular
>>> >      >                 artifacts.
>>> >      >                 At this time, we only have Beam File System based
>>> >      >                 artifact staging which users local credentials to
>>> >     access
>>> >      >                 different file systems. Even a docker based 
>>> > expansion
>>> >      >                 service running on local machine might not have
>>> >     the same
>>> >      >                 access privileges.
>>> >      >
>>> >      >                 In brief this is what I am leaning toward.
>>> >      >                 User call for pipeline submission -> Expansion
>>> >     service
>>> >      >                 provide cross language transforms and relevant
>>> >     artifacts
>>> >      >                 to the Sdk -> Sdk Submits the pipeline to
>>> >     Jobserver and
>>> >      >                 Stages user and cross language artifacts to 
>>> > artifacts
>>> >      >                 staging service
>>> >      >
>>> >      >
>>> >      >                 On Thu, Apr 18, 2019 at 2:33 PM Chamikara Jayalath
>>> >      >                 <chamik...@google.com
>>> >     <mailto:chamik...@google.com> <mailto:chamik...@google.com
>>> >     <mailto:chamik...@google.com>>> wrote:
>>> >      >
>>> >      >
>>> >      >
>>> >      >                     On Thu, Apr 18, 2019 at 2:12 PM Lukasz Cwik
>>> >      >                     <lc...@google.com <mailto:lc...@google.com>
>>> >     <mailto:lc...@google.com <mailto:lc...@google.com>>> wrote:
>>> >      >
>>> >      >                         Note that Max did ask whether making the
>>> >      >                         expansion service do the staging made
>>> >     sense, and
>>> >      >                         my first line was agreeing with that
>>> >     direction
>>> >      >                         and expanding on how it could be done (so
>>> >     this
>>> >      >                         is really Max's idea or from whomever he
>>> >     got the
>>> >      >                         idea from).
>>> >      >
>>> >      >
>>> >      >                     +1 to what Max said then :)
>>> >      >
>>> >      >
>>> >      >                         I believe a lot of the value of the 
>>> > expansion
>>> >      >                         service is not having users need to be
>>> >     aware of
>>> >      >                         all the SDK specific dependencies when
>>> >     they are
>>> >      >                         trying to create a pipeline, only the
>>> >     "user" who
>>> >      >                         is launching the expansion service may
>>> >     need to.
>>> >      >                         And in that case we can have a prepackaged
>>> >      >                         expansion service application that does 
>>> > what
>>> >      >                         most users would want (e.g. expansion
>>> >     service as
>>> >      >                         a docker container, a single bundled jar,
>>> >     ...).
>>> >      >                         We (the Apache Beam community) could
>>> >     choose to
>>> >      >                         host a default implementation of the
>>> >     expansion
>>> >      >                         service as well.
>>> >      >
>>> >      >
>>> >      >                     I'm not against this. But I think this is a
>>> >      >                     secondary more advanced use-case. For a Beam
>>> >     users
>>> >      >                     that needs to use a Java transform that they
>>> >     already
>>> >      >                     have in a Python pipeline, we should provide
>>> >     a way
>>> >      >                     to allow starting up a expansion service (with
>>> >      >                     dependencies needed for that) and running a
>>> >     pipeline
>>> >      >                     that uses this external Java transform (with
>>> >      >                     dependencies that are needed at runtime).
>>> >     Probably,
>>> >      >                     it'll be enough to allow providing all
>>> >     dependencies
>>> >      >                     when starting up the expansion service and 
>>> > allow
>>> >      >                     expansion service to do the staging of jars are
>>> >      >                     well. I don't see a need to include the list
>>> >     of jars
>>> >      >                     in the ExpansionResponse sent to the Python 
>>> > SDK.
>>> >      >
>>> >      >
>>> >      >                         On Thu, Apr 18, 2019 at 2:02 PM Chamikara
>>> >      >                         Jayalath <chamik...@google.com
>>> >     <mailto:chamik...@google.com>
>>> >      >                         <mailto:chamik...@google.com
>>> >     <mailto:chamik...@google.com>>> wrote:
>>> >      >
>>> >      >                             I think there are two kind of
>>> >     dependencies
>>> >      >                             we have to consider.
>>> >      >
>>> >      >                             (1) Dependencies that are needed to
>>> >     expand
>>> >      >                             the transform.
>>> >      >
>>> >      >                             These have to be provided when we
>>> >     start the
>>> >      >                             expansion service so that available
>>> >     external
>>> >      >                             transforms are correctly registered
>>> >     with the
>>> >      >                             expansion service.
>>> >      >
>>> >      >                             (2) Dependencies that are not needed at
>>> >      >                             expansion but may be needed at runtime.
>>> >      >
>>> >      >                             I think in both cases, users have to
>>> >     provide
>>> >      >                             these dependencies either when 
>>> > expansion
>>> >      >                             service is started or when a pipeline 
>>> > is
>>> >      >                             being executed.
>>> >      >
>>> >      >                             Max, I'm not sure why expansion
>>> >     service will
>>> >      >                             need to provide dependencies to the 
>>> > user
>>> >      >                             since user will already be aware of
>>> >     these.
>>> >      >                             Are you talking about a expansion 
>>> > service
>>> >      >                             that is readily available that will
>>> >     be used
>>> >      >                             by many Beam users ? I think such a
>>> >      >                             (possibly long running) service will
>>> >     have to
>>> >      >                             maintain a repository of transforms and
>>> >      >                             should have mechanism for registering 
>>> > new
>>> >      >                             transforms and discovering already
>>> >      >                             registered transforms etc. I think
>>> >     there's
>>> >      >                             more design work needed to make 
>>> > transform
>>> >      >                             expansion service support such 
>>> > use-cases.
>>> >      >                             Currently, I think allowing pipeline
>>> >     author
>>> >      >                             to provide the jars when starting the
>>> >      >                             expansion service and when executing 
>>> > the
>>> >      >                             pipeline will be adequate.
>>> >      >
>>> >      >                             Regarding the entity that will
>>> >     perform the
>>> >      >                             staging, I like Luke's idea of allowing
>>> >      >                             expansion service to do the staging
>>> >     (of jars
>>> >      >                             provided by the user). Notion of
>>> >     artifacts
>>> >      >                             and how they are extracted/represented 
>>> > is
>>> >      >                             SDK dependent. So if the pipeline SDK
>>> >     tries
>>> >      >                             to do this we have to add n x (n -1)
>>> >      >                             configurations (for n SDKs).
>>> >      >
>>> >      >                             - Cham
>>> >      >
>>> >      >                             On Thu, Apr 18, 2019 at 11:45 AM
>>> >     Lukasz Cwik
>>> >      >                             <lc...@google.com
>>> >     <mailto:lc...@google.com> <mailto:lc...@google.com
>>> >     <mailto:lc...@google.com>>>
>>> >      >                             wrote:
>>> >      >
>>> >      >                                 We can expose the artifact staging
>>> >      >                                 endpoint and artifact token to
>>> >     allow the
>>> >      >                                 expansion service to upload any
>>> >      >                                 resources its environment may
>>> >     need. For
>>> >      >                                 example, the expansion service
>>> >     for the
>>> >      >                                 Beam Java SDK would be able to
>>> >     upload jars.
>>> >      >
>>> >      >                                 In the "docker" environment, the
>>> >     Apache
>>> >      >                                 Beam Java SDK harness container 
>>> > would
>>> >      >                                 fetch the relevant artifacts for
>>> >     itself
>>> >      >                                 and be able to execute the 
>>> > pipeline.
>>> >      >                                 (Note that a docker environment 
>>> > could
>>> >      >                                 skip all this artifact staging if 
>>> > the
>>> >      >                                 docker environment contained all
>>> >      >                                 necessary artifacts).
>>> >      >
>>> >      >                                 For the existing "external"
>>> >     environment,
>>> >      >                                 it should already come with all the
>>> >      >                                 resources prepackaged wherever
>>> >      >                                 "external" points to. The "process"
>>> >      >                                 based environment could choose to 
>>> > use
>>> >      >                                 the artifact staging service to 
>>> > fetch
>>> >      >                                 those resources associated with its
>>> >      >                                 process or it could follow the same
>>> >      >                                 pattern that "external" would do 
>>> > and
>>> >      >                                 already contain all the prepackaged
>>> >      >                                 resources. Note that both
>>> >     "external" and
>>> >      >                                 "process" will require the
>>> >     instance of
>>> >      >                                 the expansion service to be
>>> >     specialized
>>> >      >                                 for those environments which is
>>> >     why the
>>> >      >                                 default should for the expansion
>>> >     service
>>> >      >                                 to be the "docker" environment.
>>> >      >
>>> >      >                                 Note that a major reason for
>>> >     going with
>>> >      >                                 docker containers as the 
>>> > environment
>>> >      >                                 that all runners should support
>>> >     is that
>>> >      >                                 containers provides a solution
>>> >     for this
>>> >      >                                 exact issue. Both the "process" and
>>> >      >                                 "external" environments are
>>> >     explicitly
>>> >      >                                 limiting and expanding their
>>> >      >                                 capabilities will quickly have us
>>> >      >                                 building something like a docker
>>> >      >                                 container because we'll quickly 
>>> > find
>>> >      >                                 ourselves solving the same
>>> >     problems that
>>> >      >                                 docker containers provide 
>>> > (resources,
>>> >      >                                 file layout, permissions, ...)
>>> >      >
>>> >      >
>>> >      >
>>> >      >
>>> >      >                                 On Thu, Apr 18, 2019 at 11:21 AM
>>> >      >                                 Maximilian Michels
>>> >     <m...@apache.org <mailto:m...@apache.org>
>>> >      >                                 <mailto:m...@apache.org
>>> >     <mailto:m...@apache.org>>> wrote:
>>> >      >
>>> >      >                                     Hi everyone,
>>> >      >
>>> >      >                                     We have previously merged 
>>> > support
>>> >      >                                     for configuring transforms 
>>> > across
>>> >      >                                     languages. Please see Cham's
>>> >     summary
>>> >      >                                     on the discussion [1]. There is
>>> >      >                                     also a design document [2].
>>> >      >
>>> >      >                                     Subsequently, we've added
>>> >     wrappers
>>> >      >                                     for cross-language transforms
>>> >     to the
>>> >      >                                     Python SDK, i.e.
>>> >     GenerateSequence,
>>> >      >                                     ReadFromKafka, and there is a
>>> >     pending
>>> >      >                                     PR [1] for WriteToKafka. All
>>> >     of them
>>> >      >                                     utilize Java transforms via
>>> >      >                                     cross-language configuration.
>>> >      >
>>> >      >                                     That is all pretty exciting :)
>>> >      >
>>> >      >                                     We still have some issues to
>>> >     solve,
>>> >      >                                     one being how to stage
>>> >     artifact from
>>> >      >                                     a foreign environment. When
>>> >     we run
>>> >      >                                     external transforms which are
>>> >     part of
>>> >      >                                     Beam's core (e.g.
>>> >     GenerateSequence),
>>> >      >                                     we have them available in the 
>>> > SDK
>>> >      >                                     Harness. However, when they
>>> >     are not
>>> >      >                                     (e.g. KafkaIO) we need to
>>> >     stage the
>>> >      >                                     necessary files.
>>> >      >
>>> >      >                                     For my PR [3] I've naively 
>>> > added
>>> >      >                                     ":beam-sdks-java-io-kafka" to
>>> >     the SDK
>>> >      >                                     Harness which caused dependency
>>> >      >                                     problems [4]. Those could be
>>> >     resolved
>>> >      >                                     but the bigger question is how 
>>> > to
>>> >      >                                     stage artifacts for external
>>> >      >                                     transforms programmatically?
>>> >      >
>>> >      >                                     Heejong has solved this by
>>> >     adding a
>>> >      >                                     "--jar_package" option to the
>>> >     Python
>>> >      >                                     SDK to stage Java files [5].
>>> >     I think
>>> >      >                                     that is a better solution than
>>> >      >                                     adding required Jars to the SDK
>>> >      >                                     Harness directly, but it is
>>> >     not very
>>> >      >                                     convenient for users.
>>> >      >
>>> >      >                                     I've discussed this today with
>>> >      >                                     Thomas and we both figured
>>> >     that the
>>> >      >                                     expansion service needs to
>>> >     provide a
>>> >      >                                     list of required Jars with the
>>> >      >                                     ExpansionResponse it
>>> >     provides. It's
>>> >      >                                     not entirely clear, how we
>>> >     determine
>>> >      >                                     which artifacts are necessary
>>> >     for an
>>> >      >                                     external transform. We could 
>>> > just
>>> >      >                                     dump the entire classpath
>>> >     like we do
>>> >      >                                     in PipelineResources for Java
>>> >      >                                     pipelines. This provides many
>>> >      >                                     unneeded classes but would 
>>> > work.
>>> >      >
>>> >      >                                     Do you think it makes sense
>>> >     for the
>>> >      >                                     expansion service to provide 
>>> > the
>>> >      >                                     artifacts? Perhaps you have a
>>> >     better
>>> >      >                                     idea how to resolve the staging
>>> >      >                                     problem in cross-language
>>> >     pipelines?
>>> >      >
>>> >      >                                     Thanks,
>>> >      >                                     Max
>>> >      >
>>> >      >                                     [1]
>>> >      >
>>> >     
>>> > https://lists.apache.org/thread.html/b99ba8527422e31ec7bb7ad9dc3a6583551ea392ebdc5527b5fb4a67@%3Cdev.beam.apache.org%3E
>>> >      >
>>> >      >                                     [2]
>>> >      > https://s.apache.org/beam-cross-language-io
>>> >      >
>>> >      >                                     [3]
>>> >      > https://github.com/apache/beam/pull/8322#discussion_r276336748
>>> >      >
>>> >      >                                     [4] Dependency graph for
>>> >      >                                     beam-runners-direct-java:
>>> >      >
>>> >      >                                     beam-runners-direct-java ->
>>> >      >                                     sdks-java-harness ->
>>> >      >                                     beam-sdks-java-io-kafka
>>> >      >                                     -> beam-runners-direct-java
>>> >     ... the
>>> >      >                                     cycle continues
>>> >      >
>>> >      >                                     Beam-runners-direct-java
>>> >     depends on
>>> >      >                                     sdks-java-harness due
>>> >      >                                     to the infamous Universal Local
>>> >      >                                     Runner.
>>> >     Beam-sdks-java-io-kafka depends
>>> >      >                                     on beam-runners-direct-java for
>>> >      >                                     running tests.
>>> >      >
>>> >      >                                     [5]
>>> >      > https://github.com/apache/beam/pull/8340
>>> >      >
>>> >

Re: Artifact staging in cross-language pipelines

Reply via email to