Re: Artifact staging in cross-language pipelines

Heejong Lee Tue, 23 Apr 2019 19:18:56 -0700

2019년 4월 23일 (화) 오전 2:07, Robert Bradshaw <[email protected]>님이 작성:


> I've been out, so coming a bit late to the discussion, but here's my
> thoughts.
>
> The expansion service absolutely needs to be able to provide the
> dependencies for the transform(s) it expands. It seems the default,
> foolproof way of doing this is via the environment, which can be a
> docker image with all the required dependencies. More than this an
> (arguably important, but possibly messy) optimization.
>
> The standard way to provide artifacts outside of the environment is
> via the artifact staging service. Of course, the expansion service may
> not have access to the (final) artifact staging service (due to
> permissions, locality, or it may not even be started up yet) but the
> SDK invoking the expansion service could offer an artifact staging
> environment for the SDK to publish artifacts to. However, there are
> some difficulties here, in particular avoiding name collisions with
> staged artifacts, assigning semantic meaning to the artifacts (e.g.
> should jar files get automatically placed in the classpath, or Python
> packages recognized and installed at startup). The alternative is
> going with a (type, pointer) scheme for naming dependencies; if we go
> this route I think we should consider migrating all artifact staging
> to this style. I am concerned that the "file" version will be less
> than useful for what will become the most convenient expansion
> services (namely, hosted and docker image). I am still at a loss,
> however, as to how to solve the diamond dependency problem among
> dependencies--perhaps the information is there if one walks
> maven/pypi/go modules/... but do we expect every runner to know about
> every packaging platform? This also wouldn't solve the issue if fat
> jars are used as dependencies. The only safe thing to do here is to
> force distinct dependency sets to live in different environments,
> which could be too conservative.
>
> This all leads me to think that perhaps the environment itself should
> be docker image (often one of "vanilla" beam-java-x.y ones) +
> dependency list, rather than have the dependency/artifact list as some
> kind of data off to the side. In this case, the runner would (as
> requested by its configuration) be free to merge environments it
> deemed compatible, including swapping out beam-java-X for
> beam-java-embedded if it considers itself compatible with the
> dependency list.


Like this idea to build multiple docker environments on top of a bare
minimum SDK harness container and allow runners to pick a suitable one
based on a dependency list.


>
> I agree with Thomas that we'll want to make expansion services, and
> the transforms they offer, more discoverable. The whole lifetime cycle
> of expansion services is something that has yet to be fully fleshed
> out, and may influence some of these decisions.
>
> As for adding --jar_package to the Python SDK, this seems really
> specific to calling java-from-python (would we have O(n^2) such
> options?) as well as out-of-place for a Python user to specify. I
> would really hope we can figure out a more generic solution. If we
> need this option in the meantime, let's at least make it clear
> (probably in the name) that it's temporary.
>

Good points. I second that we need a more generic solution than
python-to-java specific option. I think instead of naming differently we
can make --jar_package a secondary option under --experiment in the
meantime. WDYT?


> On Tue, Apr 23, 2019 at 1:08 AM Thomas Weise <[email protected]> wrote:
> >
> > One more suggestion:
> >
> > It would be nice to be able to select the environment for the external
> transforms. For example, I would like to be able to use EMBEDDED for Flink.
> That's implicit for sources which are runner native unbounded read
> translations, but it should also be possible for writes. That would then be
> similar to how pipelines are packaged and run with the "legacy" runner.
> >
> > Thomas
> >
> >
> > On Mon, Apr 22, 2019 at 1:18 PM Ankur Goenka <[email protected]> wrote:
> >>
> >> Great discussion!
> >> I have a few points around the structure of proto but that is less
> important as it can evolve.
> >> However, I think that artifact compatibility is another important
> aspect to look at.
> >> Example: TransformA uses Guava 1.6>< 1.7, TransformB uses 1.8><1.9 and
> TransformC uses 1.6><1.8. As sdk provide the environment for each
> transform, it can not simply say EnvironmentJava for both TransformA and
> TransformB as the dependencies are not compatible.
> >> We should have separate environment associated with TransformA and
> TransformB in this case.
> >>
> >> To support this case, we need 2 things.
> >> 1: Granular metadata about the dependency including type.
> >> 2: Complete list of the transforms to be expanded.
> >>
> >> Elaboration:
> >> The compatibility check can be done in a crude way if we provide all
> the metadata about the dependency to expansion service.
> >> Also, the expansion service should expand all the applicable transforms
> in a single call so that it knows about incompatibility and create separate
> environments for these transforms. So in the above example, expansion
> service will associate EnvA to TransformA and EnvB to TransformB and EnvA
> to TransformC. This will ofcource require changes to Expansion service
> proto but giving all the information to expansion service will make it
> support more case and make it a bit more future proof.
> >>
> >>
> >> On Mon, Apr 22, 2019 at 10:16 AM Maximilian Michels <[email protected]>
> wrote:
> >>>
> >>> Thanks for the summary Cham. All makes sense. I agree that we want to
> >>> keep the option to manually specify artifacts.
> >>>
> >>> > There are few unanswered questions though.
> >>> > (1) In what form will a transform author specify dependencies ? For
> example, URL to a Maven repo, URL to a local file, blob ?
> >>>
> >>> Going forward, we probably want to support multiple ways. For now, we
> >>> could stick with a URL-based approach with support for different file
> >>> systems. In the future a list of packages to retrieve from Maven/PyPi
> >>> would be useful.
> >>>
> >> We can ask user for (type, metadata). For maven it can be something
> like (MAVEN, {groupId:com.google.guava, artifactId: guava, version: 19}) or
> (FILE, file://myfile)
> >> To begin with, we can only support a few types like File and can add
> more types in future.
> >>>
> >>> > (2) How will dependencies be included in the expansion response
> proto ? String (URL), bytes (blob) ?
> >>>
> >>> I'd go for a list of Protobuf strings first but the format would have
> to
> >>> evolve for other dependency types.
> >>>
> >> Here also (type, payload) should suffice. We can have interpreter for
> each type to translate the payload.
> >>>
> >>> > (3) How will we manage/share transitive dependencies required at
> runtime ?
> >>>
> >>> I'd say transitive dependencies have to be included in the list. In
> case
> >>> of fat jars, they are reduced to a single jar.
> >>
> >> Makes sense.
> >>>
> >>>
> >>> > (4) How will dependencies be staged for various runner/SDK
> combinations ? (for example, portable runner/Flink, Dataflow runner)
> >>>
> >>> Staging should be no different than it is now, i.e. go through Beam's
> >>> artifact staging service. As long as the protocol is stable, there
> could
> >>> also be different implementations.
> >>
> >> Makes sense.
> >>>
> >>>
> >>> -Max
> >>>
> >>> On 20.04.19 03:08, Chamikara Jayalath wrote:
> >>> > OK, sounds like this is a good path forward then.
> >>> >
> >>> > * When starting up the expansion service, user (that starts up the
> >>> > service) provide dependencies necessary to expand transforms. We will
> >>> > later add support for adding new transforms to an already running
> >>> > expansion service.
> >>> > * As a part of transform configuration, transform author have the
> option
> >>> > of providing a list of dependencies that will be needed to run the
> >>> > transform.
> >>> > * These dependencies will be send back to the pipeline SDK as a part
> of
> >>> > expansion response and pipeline SDK will stage these resources.
> >>> > * Pipeline author have the option of specifying the dependencies
> using a
> >>> > pipeline option. (for example,
> https://github.com/apache/beam/pull/8340)
> >>> >
> >>> > I think last option is important to (1) make existing transform
> easily
> >>> > available for cross-language usage without additional configurations
> (2)
> >>> > allow pipeline authors to override dependency versions specified by
> in
> >>> > the transform configuration (for example, to apply security patches)
> >>> > without updating the expansion service.
> >>> >
> >>> > There are few unanswered questions though.
> >>> > (1) In what form will a transform author specify dependencies ? For
> >>> > example, URL to a Maven repo, URL to a local file, blob ?
> >>> > (2) How will dependencies be included in the expansion response
> proto ?
> >>> > String (URL), bytes (blob) ?
> >>> > (3) How will we manage/share transitive dependencies required at
> runtime ?
> >>> > (4) How will dependencies be staged for various runner/SDK
> combinations
> >>> > ? (for example, portable runner/Flink, Dataflow runner)
> >>> >
> >>> > Thanks,
> >>> > Cham
> >>> >
> >>> > On Fri, Apr 19, 2019 at 4:49 AM Maximilian Michels <[email protected]
> >>> > <mailto:[email protected]>> wrote:
> >>> >
> >>> >     Thank you for your replies.
> >>> >
> >>> >     I did not suggest that the Expansion Service does the staging,
> but it
> >>> >     would return the required resources (e.g. jars) for the external
> >>> >     transform's runtime environment. The client then has to take
> care of
> >>> >     staging the resources.
> >>> >
> >>> >     The Expansion Service itself also needs resources to do the
> >>> >     expansion. I
> >>> >     assumed those to be provided when starting the expansion
> service. I
> >>> >     consider it less important but we could also provide a way to
> add new
> >>> >     transforms to the Expansion Service after startup.
> >>> >
> >>> >     Good point on Docker vs externally provided environments. For
> the PR
> >>> >     [1]
> >>> >     it will suffice then to add Kafka to the container dependencies.
> The
> >>> >     "--jar_package" pipeline option is ok for now but I'd like to
> see work
> >>> >     towards staging resources for external transforms via information
> >>> >     returned by the Expansion Service. That avoids users having to
> take
> >>> >     care
> >>> >     of including the correct jars in their pipeline options.
> >>> >
> >>> >     These issues are related and we could discuss them in separate
> threads:
> >>> >
> >>> >     * Auto-discovery of Expansion Service and its external transforms
> >>> >     * Credentials required during expansion / runtime
> >>> >
> >>> >     Thanks,
> >>> >     Max
> >>> >
> >>> >     [1] ttps://github.com/apache/beam/pull/8322
> >>> >     <http://github.com/apache/beam/pull/8322>
> >>> >
> >>> >     On 19.04.19 07:35, Thomas Weise wrote:
> >>> >      > Good discussion :)
> >>> >      >
> >>> >      > Initially the expansion service was considered a user
> >>> >     responsibility,
> >>> >      > but I think that isn't necessarily the case. I can also see
> the
> >>> >      > expansion service provided as part of the infrastructure and
> the
> >>> >     user
> >>> >      > not wanting to deal with it at all. For example, users may
> want
> >>> >     to write
> >>> >      > Python transforms and use external IOs, without being
> concerned how
> >>> >      > these IOs are provided. Under such scenario it would be good
> if:
> >>> >      >
> >>> >      > * Expansion service(s) can be auto-discovered via the job
> service
> >>> >     endpoint
> >>> >      > * Available external transforms can be discovered via the
> expansion
> >>> >      > service(s)
> >>> >      > * Dependencies for external transforms are part of the
> metadata
> >>> >     returned
> >>> >      > by expansion service
> >>> >      >
> >>> >      > Dependencies could then be staged either by the SDK client or
> the
> >>> >      > expansion service. The expansion service could provide the
> >>> >     locations to
> >>> >      > stage to the SDK, it would still be transparent to the user.
> >>> >      >
> >>> >      > I also agree with Luke regarding the environments. Docker is
> the
> >>> >     choice
> >>> >      > for generic deployment. Other environments are used when the
> >>> >     flexibility
> >>> >      > offered by Docker isn't needed (or gets into the way). Then
> the
> >>> >      > dependencies are provided in different ways. Whether these are
> >>> >     Python
> >>> >      > packages or jar files, by opting out of Docker the decision is
> >>> >     made to
> >>> >      > manage dependencies externally.
> >>> >      >
> >>> >      > Thomas
> >>> >      >
> >>> >      >
> >>> >      > On Thu, Apr 18, 2019 at 6:01 PM Chamikara Jayalath
> >>> >     <[email protected] <mailto:[email protected]>
> >>> >      > <mailto:[email protected] <mailto:[email protected]>>>
> wrote:
> >>> >      >
> >>> >      >
> >>> >      >
> >>> >      >     On Thu, Apr 18, 2019 at 5:21 PM Chamikara Jayalath
> >>> >      >     <[email protected] <mailto:[email protected]>
> >>> >     <mailto:[email protected] <mailto:[email protected]>>>
> wrote:
> >>> >      >
> >>> >      >         Thanks for raising the concern about credentials
> Ankur, I
> >>> >     agree
> >>> >      >         that this is a significant issue.
> >>> >      >
> >>> >      >         On Thu, Apr 18, 2019 at 4:23 PM Lukasz Cwik
> >>> >     <[email protected] <mailto:[email protected]>
> >>> >      >         <mailto:[email protected] <mailto:[email protected]>>>
> wrote:
> >>> >      >
> >>> >      >             I can understand the concern about credentials,
> the same
> >>> >      >             access concern will exist for several cross
> language
> >>> >      >             transforms (mostly IOs) since some will need
> access to
> >>> >      >             credentials to read/write to an external service.
> >>> >      >
> >>> >      >             Are there any ideas on how credential propagation
> >>> >     could work
> >>> >      >             to these IOs?
> >>> >      >
> >>> >      >
> >>> >      >         There are some cases where existing IO transforms need
> >>> >      >         credentials to access remote resources, for example,
> size
> >>> >      >         estimation, validation, etc. But usually these are
> >>> >     optional (or
> >>> >      >         transform can be configured to not perform these
> functions).
> >>> >      >
> >>> >      >
> >>> >      >     To clarify, I'm only talking about transform expansion
> here.
> >>> >     Many IO
> >>> >      >     transforms need read/write access to remote services at
> run
> >>> >     time. So
> >>> >      >     probably we need to figure out a way to propagate these
> >>> >     credentials
> >>> >      >     anyways.
> >>> >      >
> >>> >      >             Can we use these mechanisms for staging?
> >>> >      >
> >>> >      >
> >>> >      >         I think we'll have to find a way to do one of (1)
> propagate
> >>> >      >         credentials to other SDKs (2) allow users to
> configure SDK
> >>> >      >         containers to have necessary credentials (3) do the
> artifact
> >>> >      >         staging from the pipeline SDK environment which
> already have
> >>> >      >         credentials. I prefer (1) or (2) since this will
> given a
> >>> >      >         transform same feature set whether used directly (in
> the same
> >>> >      >         SDK language as the transform) or remotely but it
> might
> >>> >     be hard
> >>> >      >         to do this for an arbitrary service that a transform
> might
> >>> >      >         connect to considering the number of ways users can
> configure
> >>> >      >         credentials (after an offline discussion with Ankur).
> >>> >      >
> >>> >      >
> >>> >      >             On Thu, Apr 18, 2019 at 3:47 PM Ankur Goenka
> >>> >      >             <[email protected] <mailto:[email protected]>
> >>> >     <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>> >      >
> >>> >      >                 I agree that the Expansion service knows
> about the
> >>> >      >                 artifacts required for a cross language
> transform and
> >>> >      >                 having a prepackage folder/Zip for transforms
> >>> >     based on
> >>> >      >                 language makes sense.
> >>> >      >
> >>> >      >                 One think to note here is that expansion
> service
> >>> >     might
> >>> >      >                 not have the same access privilege as the
> pipeline
> >>> >      >                 author and hence might not be able to stage
> >>> >     artifacts by
> >>> >      >                 itself.
> >>> >      >                 Keeping this in mind I am leaning towards
> making
> >>> >      >                 Expansion service provide all the required
> >>> >     artifacts to
> >>> >      >                 the user and let the user stage the artifacts
> as
> >>> >     regular
> >>> >      >                 artifacts.
> >>> >      >                 At this time, we only have Beam File System
> based
> >>> >      >                 artifact staging which users local
> credentials to
> >>> >     access
> >>> >      >                 different file systems. Even a docker based
> expansion
> >>> >      >                 service running on local machine might not
> have
> >>> >     the same
> >>> >      >                 access privileges.
> >>> >      >
> >>> >      >                 In brief this is what I am leaning toward.
> >>> >      >                 User call for pipeline submission -> Expansion
> >>> >     service
> >>> >      >                 provide cross language transforms and relevant
> >>> >     artifacts
> >>> >      >                 to the Sdk -> Sdk Submits the pipeline to
> >>> >     Jobserver and
> >>> >      >                 Stages user and cross language artifacts to
> artifacts
> >>> >      >                 staging service
> >>> >      >
> >>> >      >
> >>> >      >                 On Thu, Apr 18, 2019 at 2:33 PM Chamikara
> Jayalath
> >>> >      >                 <[email protected]
> >>> >     <mailto:[email protected]> <mailto:[email protected]
> >>> >     <mailto:[email protected]>>> wrote:
> >>> >      >
> >>> >      >
> >>> >      >
> >>> >      >                     On Thu, Apr 18, 2019 at 2:12 PM Lukasz
> Cwik
> >>> >      >                     <[email protected] <mailto:
> [email protected]>
> >>> >     <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>> >      >
> >>> >      >                         Note that Max did ask whether making
> the
> >>> >      >                         expansion service do the staging made
> >>> >     sense, and
> >>> >      >                         my first line was agreeing with that
> >>> >     direction
> >>> >      >                         and expanding on how it could be done
> (so
> >>> >     this
> >>> >      >                         is really Max's idea or from whomever
> he
> >>> >     got the
> >>> >      >                         idea from).
> >>> >      >
> >>> >      >
> >>> >      >                     +1 to what Max said then :)
> >>> >      >
> >>> >      >
> >>> >      >                         I believe a lot of the value of the
> expansion
> >>> >      >                         service is not having users need to be
> >>> >     aware of
> >>> >      >                         all the SDK specific dependencies when
> >>> >     they are
> >>> >      >                         trying to create a pipeline, only the
> >>> >     "user" who
> >>> >      >                         is launching the expansion service may
> >>> >     need to.
> >>> >      >                         And in that case we can have a
> prepackaged
> >>> >      >                         expansion service application that
> does what
> >>> >      >                         most users would want (e.g. expansion
> >>> >     service as
> >>> >      >                         a docker container, a single bundled
> jar,
> >>> >     ...).
> >>> >      >                         We (the Apache Beam community) could
> >>> >     choose to
> >>> >      >                         host a default implementation of the
> >>> >     expansion
> >>> >      >                         service as well.
> >>> >      >
> >>> >      >
> >>> >      >                     I'm not against this. But I think this is
> a
> >>> >      >                     secondary more advanced use-case. For a
> Beam
> >>> >     users
> >>> >      >                     that needs to use a Java transform that
> they
> >>> >     already
> >>> >      >                     have in a Python pipeline, we should
> provide
> >>> >     a way
> >>> >      >                     to allow starting up a expansion service
> (with
> >>> >      >                     dependencies needed for that) and running
> a
> >>> >     pipeline
> >>> >      >                     that uses this external Java transform
> (with
> >>> >      >                     dependencies that are needed at runtime).
> >>> >     Probably,
> >>> >      >                     it'll be enough to allow providing all
> >>> >     dependencies
> >>> >      >                     when starting up the expansion service
> and allow
> >>> >      >                     expansion service to do the staging of
> jars are
> >>> >      >                     well. I don't see a need to include the
> list
> >>> >     of jars
> >>> >      >                     in the ExpansionResponse sent to the
> Python SDK.
> >>> >      >
> >>> >      >
> >>> >      >                         On Thu, Apr 18, 2019 at 2:02 PM
> Chamikara
> >>> >      >                         Jayalath <[email protected]
> >>> >     <mailto:[email protected]>
> >>> >      >                         <mailto:[email protected]
> >>> >     <mailto:[email protected]>>> wrote:
> >>> >      >
> >>> >      >                             I think there are two kind of
> >>> >     dependencies
> >>> >      >                             we have to consider.
> >>> >      >
> >>> >      >                             (1) Dependencies that are needed
> to
> >>> >     expand
> >>> >      >                             the transform.
> >>> >      >
> >>> >      >                             These have to be provided when we
> >>> >     start the
> >>> >      >                             expansion service so that
> available
> >>> >     external
> >>> >      >                             transforms are correctly
> registered
> >>> >     with the
> >>> >      >                             expansion service.
> >>> >      >
> >>> >      >                             (2) Dependencies that are not
> needed at
> >>> >      >                             expansion but may be needed at
> runtime.
> >>> >      >
> >>> >      >                             I think in both cases, users have
> to
> >>> >     provide
> >>> >      >                             these dependencies either when
> expansion
> >>> >      >                             service is started or when a
> pipeline is
> >>> >      >                             being executed.
> >>> >      >
> >>> >      >                             Max, I'm not sure why expansion
> >>> >     service will
> >>> >      >                             need to provide dependencies to
> the user
> >>> >      >                             since user will already be aware
> of
> >>> >     these.
> >>> >      >                             Are you talking about a expansion
> service
> >>> >      >                             that is readily available that
> will
> >>> >     be used
> >>> >      >                             by many Beam users ? I think such
> a
> >>> >      >                             (possibly long running) service
> will
> >>> >     have to
> >>> >      >                             maintain a repository of
> transforms and
> >>> >      >                             should have mechanism for
> registering new
> >>> >      >                             transforms and discovering already
> >>> >      >                             registered transforms etc. I think
> >>> >     there's
> >>> >      >                             more design work needed to make
> transform
> >>> >      >                             expansion service support such
> use-cases.
> >>> >      >                             Currently, I think allowing
> pipeline
> >>> >     author
> >>> >      >                             to provide the jars when starting
> the
> >>> >      >                             expansion service and when
> executing the
> >>> >      >                             pipeline will be adequate.
> >>> >      >
> >>> >      >                             Regarding the entity that will
> >>> >     perform the
> >>> >      >                             staging, I like Luke's idea of
> allowing
> >>> >      >                             expansion service to do the
> staging
> >>> >     (of jars
> >>> >      >                             provided by the user). Notion of
> >>> >     artifacts
> >>> >      >                             and how they are
> extracted/represented is
> >>> >      >                             SDK dependent. So if the pipeline
> SDK
> >>> >     tries
> >>> >      >                             to do this we have to add n x (n
> -1)
> >>> >      >                             configurations (for n SDKs).
> >>> >      >
> >>> >      >                             - Cham
> >>> >      >
> >>> >      >                             On Thu, Apr 18, 2019 at 11:45 AM
> >>> >     Lukasz Cwik
> >>> >      >                             <[email protected]
> >>> >     <mailto:[email protected]> <mailto:[email protected]
> >>> >     <mailto:[email protected]>>>
> >>> >      >                             wrote:
> >>> >      >
> >>> >      >                                 We can expose the artifact
> staging
> >>> >      >                                 endpoint and artifact token to
> >>> >     allow the
> >>> >      >                                 expansion service to upload
> any
> >>> >      >                                 resources its environment may
> >>> >     need. For
> >>> >      >                                 example, the expansion service
> >>> >     for the
> >>> >      >                                 Beam Java SDK would be able to
> >>> >     upload jars.
> >>> >      >
> >>> >      >                                 In the "docker" environment,
> the
> >>> >     Apache
> >>> >      >                                 Beam Java SDK harness
> container would
> >>> >      >                                 fetch the relevant artifacts
> for
> >>> >     itself
> >>> >      >                                 and be able to execute the
> pipeline.
> >>> >      >                                 (Note that a docker
> environment could
> >>> >      >                                 skip all this artifact
> staging if the
> >>> >      >                                 docker environment contained
> all
> >>> >      >                                 necessary artifacts).
> >>> >      >
> >>> >      >                                 For the existing "external"
> >>> >     environment,
> >>> >      >                                 it should already come with
> all the
> >>> >      >                                 resources prepackaged wherever
> >>> >      >                                 "external" points to. The
> "process"
> >>> >      >                                 based environment could
> choose to use
> >>> >      >                                 the artifact staging service
> to fetch
> >>> >      >                                 those resources associated
> with its
> >>> >      >                                 process or it could follow
> the same
> >>> >      >                                 pattern that "external" would
> do and
> >>> >      >                                 already contain all the
> prepackaged
> >>> >      >                                 resources. Note that both
> >>> >     "external" and
> >>> >      >                                 "process" will require the
> >>> >     instance of
> >>> >      >                                 the expansion service to be
> >>> >     specialized
> >>> >      >                                 for those environments which
> is
> >>> >     why the
> >>> >      >                                 default should for the
> expansion
> >>> >     service
> >>> >      >                                 to be the "docker"
> environment.
> >>> >      >
> >>> >      >                                 Note that a major reason for
> >>> >     going with
> >>> >      >                                 docker containers as the
> environment
> >>> >      >                                 that all runners should
> support
> >>> >     is that
> >>> >      >                                 containers provides a solution
> >>> >     for this
> >>> >      >                                 exact issue. Both the
> "process" and
> >>> >      >                                 "external" environments are
> >>> >     explicitly
> >>> >      >                                 limiting and expanding their
> >>> >      >                                 capabilities will quickly
> have us
> >>> >      >                                 building something like a
> docker
> >>> >      >                                 container because we'll
> quickly find
> >>> >      >                                 ourselves solving the same
> >>> >     problems that
> >>> >      >                                 docker containers provide
> (resources,
> >>> >      >                                 file layout, permissions, ...)
> >>> >      >
> >>> >      >
> >>> >      >
> >>> >      >
> >>> >      >                                 On Thu, Apr 18, 2019 at 11:21
> AM
> >>> >      >                                 Maximilian Michels
> >>> >     <[email protected] <mailto:[email protected]>
> >>> >      >                                 <mailto:[email protected]
> >>> >     <mailto:[email protected]>>> wrote:
> >>> >      >
> >>> >      >                                     Hi everyone,
> >>> >      >
> >>> >      >                                     We have previously merged
> support
> >>> >      >                                     for configuring
> transforms across
> >>> >      >                                     languages. Please see
> Cham's
> >>> >     summary
> >>> >      >                                     on the discussion [1].
> There is
> >>> >      >                                     also a design document
> [2].
> >>> >      >
> >>> >      >                                     Subsequently, we've added
> >>> >     wrappers
> >>> >      >                                     for cross-language
> transforms
> >>> >     to the
> >>> >      >                                     Python SDK, i.e.
> >>> >     GenerateSequence,
> >>> >      >                                     ReadFromKafka, and there
> is a
> >>> >     pending
> >>> >      >                                     PR [1] for WriteToKafka.
> All
> >>> >     of them
> >>> >      >                                     utilize Java transforms
> via
> >>> >      >                                     cross-language
> configuration.
> >>> >      >
> >>> >      >                                     That is all pretty
> exciting :)
> >>> >      >
> >>> >      >                                     We still have some issues
> to
> >>> >     solve,
> >>> >      >                                     one being how to stage
> >>> >     artifact from
> >>> >      >                                     a foreign environment.
> When
> >>> >     we run
> >>> >      >                                     external transforms which
> are
> >>> >     part of
> >>> >      >                                     Beam's core (e.g.
> >>> >     GenerateSequence),
> >>> >      >                                     we have them available in
> the SDK
> >>> >      >                                     Harness. However, when
> they
> >>> >     are not
> >>> >      >                                     (e.g. KafkaIO) we need to
> >>> >     stage the
> >>> >      >                                     necessary files.
> >>> >      >
> >>> >      >                                     For my PR [3] I've
> naively added
> >>> >      >
>  ":beam-sdks-java-io-kafka" to
> >>> >     the SDK
> >>> >      >                                     Harness which caused
> dependency
> >>> >      >                                     problems [4]. Those could
> be
> >>> >     resolved
> >>> >      >                                     but the bigger question
> is how to
> >>> >      >                                     stage artifacts for
> external
> >>> >      >                                     transforms
> programmatically?
> >>> >      >
> >>> >      >                                     Heejong has solved this by
> >>> >     adding a
> >>> >      >                                     "--jar_package" option to
> the
> >>> >     Python
> >>> >      >                                     SDK to stage Java files
> [5].
> >>> >     I think
> >>> >      >                                     that is a better solution
> than
> >>> >      >                                     adding required Jars to
> the SDK
> >>> >      >                                     Harness directly, but it
> is
> >>> >     not very
> >>> >      >                                     convenient for users.
> >>> >      >
> >>> >      >                                     I've discussed this today
> with
> >>> >      >                                     Thomas and we both figured
> >>> >     that the
> >>> >      >                                     expansion service needs to
> >>> >     provide a
> >>> >      >                                     list of required Jars
> with the
> >>> >      >                                     ExpansionResponse it
> >>> >     provides. It's
> >>> >      >                                     not entirely clear, how we
> >>> >     determine
> >>> >      >                                     which artifacts are
> necessary
> >>> >     for an
> >>> >      >                                     external transform. We
> could just
> >>> >      >                                     dump the entire classpath
> >>> >     like we do
> >>> >      >                                     in PipelineResources for
> Java
> >>> >      >                                     pipelines. This provides
> many
> >>> >      >                                     unneeded classes but
> would work.
> >>> >      >
> >>> >      >                                     Do you think it makes
> sense
> >>> >     for the
> >>> >      >                                     expansion service to
> provide the
> >>> >      >                                     artifacts? Perhaps you
> have a
> >>> >     better
> >>> >      >                                     idea how to resolve the
> staging
> >>> >      >                                     problem in cross-language
> >>> >     pipelines?
> >>> >      >
> >>> >      >                                     Thanks,
> >>> >      >                                     Max
> >>> >      >
> >>> >      >                                     [1]
> >>> >      >
> >>> >
> https://lists.apache.org/thread.html/b99ba8527422e31ec7bb7ad9dc3a6583551ea392ebdc5527b5fb4a67@%3Cdev.beam.apache.org%3E
> >>> >      >
> >>> >      >                                     [2]
> >>> >      > https://s.apache.org/beam-cross-language-io
> >>> >      >
> >>> >      >                                     [3]
> >>> >      >
> https://github.com/apache/beam/pull/8322#discussion_r276336748
> >>> >      >
> >>> >      >                                     [4] Dependency graph for
> >>> >      >                                     beam-runners-direct-java:
> >>> >      >
> >>> >      >                                     beam-runners-direct-java
> ->
> >>> >      >                                     sdks-java-harness ->
> >>> >      >                                     beam-sdks-java-io-kafka
> >>> >      >                                     ->
> beam-runners-direct-java
> >>> >     ... the
> >>> >      >                                     cycle continues
> >>> >      >
> >>> >      >                                     Beam-runners-direct-java
> >>> >     depends on
> >>> >      >                                     sdks-java-harness due
> >>> >      >                                     to the infamous Universal
> Local
> >>> >      >                                     Runner.
> >>> >     Beam-sdks-java-io-kafka depends
> >>> >      >                                     on
> beam-runners-direct-java for
> >>> >      >                                     running tests.
> >>> >      >
> >>> >      >                                     [5]
> >>> >      > https://github.com/apache/beam/pull/8340
> >>> >      >
> >>> >
>
>

Re: Artifact staging in cross-language pipelines

Reply via email to