Re: Artifact staging in cross-language pipelines

Maximilian Michels Mon, 22 Apr 2019 10:17:27 -0700

Thanks for the summary Cham. All makes sense. I agree that we want tokeep the option to manually specify artifacts.

There are few unanswered questions though.
(1) In what form will a transform author specify dependencies ? For example, 
URL to a Maven repo, URL to a local file, blob ?

Going forward, we probably want to support multiple ways. For now, wecould stick with a URL-based approach with support for different filesystems. In the future a list of packages to retrieve from Maven/PyPiwould be useful.

(2) How will dependencies be included in the expansion response proto ? String 
(URL), bytes (blob) ?

I'd go for a list of Protobuf strings first but the format would have toevolve for other dependency types.

(3) How will we manage/share transitive dependencies required at runtime ?

I'd say transitive dependencies have to be included in the list. In caseof fat jars, they are reduced to a single jar.

(4) How will dependencies be staged for various runner/SDK combinations ? (for 
example, portable runner/Flink, Dataflow runner)

Staging should be no different than it is now, i.e. go through Beam'sartifact staging service. As long as the protocol is stable, there couldalso be different implementations.


-Max

On 20.04.19 03:08, Chamikara Jayalath wrote:

OK, sounds like this is a good path forward then.

* When starting up the expansion service, user (that starts up theservice) provide dependencies necessary to expand transforms. We willlater add support for adding new transforms to an already runningexpansion service.* As a part of transform configuration, transform author have the optionof providing a list of dependencies that will be needed to run thetransform.* These dependencies will be send back to the pipeline SDK as a part ofexpansion response and pipeline SDK will stage these resources.* Pipeline author have the option of specifying the dependencies using apipeline option. (for example, https://github.com/apache/beam/pull/8340)

I think last option is important to (1) make existing transform easilyavailable for cross-language usage without additional configurations (2)allow pipeline authors to override dependency versions specified by inthe transform configuration (for example, to apply security patches)without updating the expansion service.


There are few unanswered questions though.

(1) In what form will a transform author specify dependencies ? Forexample, URL to a Maven repo, URL to a local file, blob ?(2) How will dependencies be included in the expansion response proto ?String (URL), bytes (blob) ?

(3) How will we manage/share transitive dependencies required at runtime ?

(4) How will dependencies be staged for various runner/SDK combinations? (for example, portable runner/Flink, Dataflow runner)


Thanks,
Cham

On Fri, Apr 19, 2019 at 4:49 AM Maximilian Michels <m...@apache.org<mailto:m...@apache.org>> wrote:


    Thank you for your replies.

    I did not suggest that the Expansion Service does the staging, but it
    would return the required resources (e.g. jars) for the external
    transform's runtime environment. The client then has to take care of
    staging the resources.

    The Expansion Service itself also needs resources to do the
    expansion. I
    assumed those to be provided when starting the expansion service. I
    consider it less important but we could also provide a way to add new
    transforms to the Expansion Service after startup.

    Good point on Docker vs externally provided environments. For the PR
    [1]
    it will suffice then to add Kafka to the container dependencies. The
    "--jar_package" pipeline option is ok for now but I'd like to see work
    towards staging resources for external transforms via information
    returned by the Expansion Service. That avoids users having to take
    care
    of including the correct jars in their pipeline options.

    These issues are related and we could discuss them in separate threads:

    * Auto-discovery of Expansion Service and its external transforms
    * Credentials required during expansion / runtime

    Thanks,
    Max

    [1] ttps://github.com/apache/beam/pull/8322
    <http://github.com/apache/beam/pull/8322>

    On 19.04.19 07:35, Thomas Weise wrote:
     > Good discussion :)
     >
     > Initially the expansion service was considered a user
    responsibility,
     > but I think that isn't necessarily the case. I can also see the
     > expansion service provided as part of the infrastructure and the
    user
     > not wanting to deal with it at all. For example, users may want
    to write
     > Python transforms and use external IOs, without being concerned how
     > these IOs are provided. Under such scenario it would be good if:
     >
     > * Expansion service(s) can be auto-discovered via the job service
    endpoint
     > * Available external transforms can be discovered via the expansion
     > service(s)
     > * Dependencies for external transforms are part of the metadata
    returned
     > by expansion service
     >
     > Dependencies could then be staged either by the SDK client or the
     > expansion service. The expansion service could provide the
    locations to
     > stage to the SDK, it would still be transparent to the user.
     >
     > I also agree with Luke regarding the environments. Docker is the
    choice
     > for generic deployment. Other environments are used when the
    flexibility
     > offered by Docker isn't needed (or gets into the way). Then the
     > dependencies are provided in different ways. Whether these are
    Python
     > packages or jar files, by opting out of Docker the decision is
    made to
     > manage dependencies externally.
     >
     > Thomas
     >
     >
     > On Thu, Apr 18, 2019 at 6:01 PM Chamikara Jayalath
    <chamik...@google.com <mailto:chamik...@google.com>
     > <mailto:chamik...@google.com <mailto:chamik...@google.com>>> wrote:
     >
     >
     >
     >     On Thu, Apr 18, 2019 at 5:21 PM Chamikara Jayalath
     >     <chamik...@google.com <mailto:chamik...@google.com>
    <mailto:chamik...@google.com <mailto:chamik...@google.com>>> wrote:
     >
     >         Thanks for raising the concern about credentials Ankur, I
    agree
     >         that this is a significant issue.
     >
     >         On Thu, Apr 18, 2019 at 4:23 PM Lukasz Cwik
    <lc...@google.com <mailto:lc...@google.com>
     >         <mailto:lc...@google.com <mailto:lc...@google.com>>> wrote:
     >
     >             I can understand the concern about credentials, the same
     >             access concern will exist for several cross language
     >             transforms (mostly IOs) since some will need access to
     >             credentials to read/write to an external service.
     >
     >             Are there any ideas on how credential propagation
    could work
     >             to these IOs?
     >
     >
     >         There are some cases where existing IO transforms need
     >         credentials to access remote resources, for example, size
     >         estimation, validation, etc. But usually these are
    optional (or
     >         transform can be configured to not perform these functions).
     >
     >
     >     To clarify, I'm only talking about transform expansion here.
    Many IO
     >     transforms need read/write access to remote services at run
    time. So
     >     probably we need to figure out a way to propagate these
    credentials
     >     anyways.
     >
     >             Can we use these mechanisms for staging?
     >
     >
     >         I think we'll have to find a way to do one of (1) propagate
     >         credentials to other SDKs (2) allow users to configure SDK
     >         containers to have necessary credentials (3) do the artifact
     >         staging from the pipeline SDK environment which already have
     >         credentials. I prefer (1) or (2) since this will given a
     >         transform same feature set whether used directly (in the same
     >         SDK language as the transform) or remotely but it might
    be hard
     >         to do this for an arbitrary service that a transform might
     >         connect to considering the number of ways users can configure
     >         credentials (after an offline discussion with Ankur).
     >
     >
     >             On Thu, Apr 18, 2019 at 3:47 PM Ankur Goenka
     >             <goe...@google.com <mailto:goe...@google.com>
    <mailto:goe...@google.com <mailto:goe...@google.com>>> wrote:
     >
     >                 I agree that the Expansion service knows about the
     >                 artifacts required for a cross language transform and
     >                 having a prepackage folder/Zip for transforms
    based on
     >                 language makes sense.
     >
     >                 One think to note here is that expansion service
    might
     >                 not have the same access privilege as the pipeline
     >                 author and hence might not be able to stage
    artifacts by
     >                 itself.
     >                 Keeping this in mind I am leaning towards making
     >                 Expansion service provide all the required
    artifacts to
     >                 the user and let the user stage the artifacts as
    regular
     >                 artifacts.
     >                 At this time, we only have Beam File System based
     >                 artifact staging which users local credentials to
    access
     >                 different file systems. Even a docker based expansion
     >                 service running on local machine might not have
    the same
     >                 access privileges.
     >
     >                 In brief this is what I am leaning toward.
     >                 User call for pipeline submission -> Expansion
    service
     >                 provide cross language transforms and relevant
    artifacts
     >                 to the Sdk -> Sdk Submits the pipeline to
    Jobserver and
     >                 Stages user and cross language artifacts to artifacts
     >                 staging service
     >
     >
     >                 On Thu, Apr 18, 2019 at 2:33 PM Chamikara Jayalath
     >                 <chamik...@google.com
    <mailto:chamik...@google.com> <mailto:chamik...@google.com
    <mailto:chamik...@google.com>>> wrote:
     >
     >
     >
     >                     On Thu, Apr 18, 2019 at 2:12 PM Lukasz Cwik
     >                     <lc...@google.com <mailto:lc...@google.com>
    <mailto:lc...@google.com <mailto:lc...@google.com>>> wrote:
     >
     >                         Note that Max did ask whether making the
     >                         expansion service do the staging made
    sense, and
     >                         my first line was agreeing with that
    direction
     >                         and expanding on how it could be done (so
    this
     >                         is really Max's idea or from whomever he
    got the
     >                         idea from).
     >
     >
     >                     +1 to what Max said then :)
     >
     >
     >                         I believe a lot of the value of the expansion
     >                         service is not having users need to be
    aware of
     >                         all the SDK specific dependencies when
    they are
     >                         trying to create a pipeline, only the
    "user" who
     >                         is launching the expansion service may
    need to.
     >                         And in that case we can have a prepackaged
     >                         expansion service application that does what
     >                         most users would want (e.g. expansion
    service as
     >                         a docker container, a single bundled jar,
    ...).
     >                         We (the Apache Beam community) could
    choose to
     >                         host a default implementation of the
    expansion
     >                         service as well.
     >
     >
     >                     I'm not against this. But I think this is a
     >                     secondary more advanced use-case. For a Beam
    users
     >                     that needs to use a Java transform that they
    already
     >                     have in a Python pipeline, we should provide
    a way
     >                     to allow starting up a expansion service (with
     >                     dependencies needed for that) and running a
    pipeline
     >                     that uses this external Java transform (with
     >                     dependencies that are needed at runtime).
    Probably,
     >                     it'll be enough to allow providing all
    dependencies
     >                     when starting up the expansion service and allow
     >                     expansion service to do the staging of jars are
     >                     well. I don't see a need to include the list
    of jars
     >                     in the ExpansionResponse sent to the Python SDK.
     >
     >
     >                         On Thu, Apr 18, 2019 at 2:02 PM Chamikara
     >                         Jayalath <chamik...@google.com
    <mailto:chamik...@google.com>
     >                         <mailto:chamik...@google.com
    <mailto:chamik...@google.com>>> wrote:
     >
     >                             I think there are two kind of
    dependencies
     >                             we have to consider.
     >
     >                             (1) Dependencies that are needed to
    expand
     >                             the transform.
     >
     >                             These have to be provided when we
    start the
     >                             expansion service so that available
    external
     >                             transforms are correctly registered
    with the
     >                             expansion service.
     >
     >                             (2) Dependencies that are not needed at
     >                             expansion but may be needed at runtime.
     >
     >                             I think in both cases, users have to
    provide
     >                             these dependencies either when expansion
     >                             service is started or when a pipeline is
     >                             being executed.
     >
     >                             Max, I'm not sure why expansion
    service will
     >                             need to provide dependencies to the user
     >                             since user will already be aware of
    these.
     >                             Are you talking about a expansion service
     >                             that is readily available that will
    be used
     >                             by many Beam users ? I think such a
     >                             (possibly long running) service will
    have to
     >                             maintain a repository of transforms and
     >                             should have mechanism for registering new
     >                             transforms and discovering already
     >                             registered transforms etc. I think
    there's
     >                             more design work needed to make transform
     >                             expansion service support such use-cases.
     >                             Currently, I think allowing pipeline
    author
     >                             to provide the jars when starting the
     >                             expansion service and when executing the
     >                             pipeline will be adequate.
     >
     >                             Regarding the entity that will
    perform the
     >                             staging, I like Luke's idea of allowing
     >                             expansion service to do the staging
    (of jars
     >                             provided by the user). Notion of
    artifacts
     >                             and how they are extracted/represented is
     >                             SDK dependent. So if the pipeline SDK
    tries
     >                             to do this we have to add n x (n -1)
     >                             configurations (for n SDKs).
     >
     >                             - Cham
     >
     >                             On Thu, Apr 18, 2019 at 11:45 AM
    Lukasz Cwik
     >                             <lc...@google.com
    <mailto:lc...@google.com> <mailto:lc...@google.com
    <mailto:lc...@google.com>>>
     >                             wrote:
     >
     >                                 We can expose the artifact staging
     >                                 endpoint and artifact token to
    allow the
     >                                 expansion service to upload any
     >                                 resources its environment may
    need. For
     >                                 example, the expansion service
    for the
     >                                 Beam Java SDK would be able to
    upload jars.
     >
     >                                 In the "docker" environment, the
    Apache
     >                                 Beam Java SDK harness container would
     >                                 fetch the relevant artifacts for
    itself
     >                                 and be able to execute the pipeline.
     >                                 (Note that a docker environment could
     >                                 skip all this artifact staging if the
     >                                 docker environment contained all
     >                                 necessary artifacts).
     >
     >                                 For the existing "external"
    environment,
     >                                 it should already come with all the
     >                                 resources prepackaged wherever
     >                                 "external" points to. The "process"
     >                                 based environment could choose to use
     >                                 the artifact staging service to fetch
     >                                 those resources associated with its
     >                                 process or it could follow the same
     >                                 pattern that "external" would do and
     >                                 already contain all the prepackaged
     >                                 resources. Note that both
    "external" and
     >                                 "process" will require the
    instance of
     >                                 the expansion service to be
    specialized
     >                                 for those environments which is
    why the
     >                                 default should for the expansion
    service
     >                                 to be the "docker" environment.
     >
     >                                 Note that a major reason for
    going with
     >                                 docker containers as the environment
     >                                 that all runners should support
    is that
     >                                 containers provides a solution
    for this
     >                                 exact issue. Both the "process" and
     >                                 "external" environments are
    explicitly
     >                                 limiting and expanding their
     >                                 capabilities will quickly have us
     >                                 building something like a docker
     >                                 container because we'll quickly find
     >                                 ourselves solving the same
    problems that
     >                                 docker containers provide (resources,
     >                                 file layout, permissions, ...)
     >
     >
     >
     >
     >                                 On Thu, Apr 18, 2019 at 11:21 AM
     >                                 Maximilian Michels
    <m...@apache.org <mailto:m...@apache.org>
     >                                 <mailto:m...@apache.org
    <mailto:m...@apache.org>>> wrote:
     >
     >                                     Hi everyone,
     >
     >                                     We have previously merged support
     >                                     for configuring transforms across
     >                                     languages. Please see Cham's
    summary
     >                                     on the discussion [1]. There is
     >                                     also a design document [2].
     >
     >                                     Subsequently, we've added
    wrappers
     >                                     for cross-language transforms
    to the
     >                                     Python SDK, i.e.
    GenerateSequence,
     >                                     ReadFromKafka, and there is a
    pending
     >                                     PR [1] for WriteToKafka. All
    of them
     >                                     utilize Java transforms via
     >                                     cross-language configuration.
     >
     >                                     That is all pretty exciting :)
     >
     >                                     We still have some issues to
    solve,
     >                                     one being how to stage
    artifact from
     >                                     a foreign environment. When
    we run
     >                                     external transforms which are
    part of
     >                                     Beam's core (e.g.
    GenerateSequence),
     >                                     we have them available in the SDK
     >                                     Harness. However, when they
    are not
     >                                     (e.g. KafkaIO) we need to
    stage the
     >                                     necessary files.
     >
     >                                     For my PR [3] I've naively added
     >                                     ":beam-sdks-java-io-kafka" to
    the SDK
     >                                     Harness which caused dependency
     >                                     problems [4]. Those could be
    resolved
     >                                     but the bigger question is how to
     >                                     stage artifacts for external
     >                                     transforms programmatically?
     >
     >                                     Heejong has solved this by
    adding a
     >                                     "--jar_package" option to the
    Python
     >                                     SDK to stage Java files [5].
    I think
     >                                     that is a better solution than
     >                                     adding required Jars to the SDK
     >                                     Harness directly, but it is
    not very
     >                                     convenient for users.
     >
     >                                     I've discussed this today with
     >                                     Thomas and we both figured
    that the
     >                                     expansion service needs to
    provide a
     >                                     list of required Jars with the
     >                                     ExpansionResponse it
    provides. It's
     >                                     not entirely clear, how we
    determine
     >                                     which artifacts are necessary
    for an
     >                                     external transform. We could just
     >                                     dump the entire classpath
    like we do
     >                                     in PipelineResources for Java
     >                                     pipelines. This provides many
     >                                     unneeded classes but would work.
     >
     >                                     Do you think it makes sense
    for the
     >                                     expansion service to provide the
     >                                     artifacts? Perhaps you have a
    better
     >                                     idea how to resolve the staging
     >                                     problem in cross-language
    pipelines?
     >
     >                                     Thanks,
     >                                     Max
     >
     >                                     [1]
     >
    
https://lists.apache.org/thread.html/b99ba8527422e31ec7bb7ad9dc3a6583551ea392ebdc5527b5fb4a67@%3Cdev.beam.apache.org%3E
     >
     >                                     [2]
     > https://s.apache.org/beam-cross-language-io
     >
     >                                     [3]
     > https://github.com/apache/beam/pull/8322#discussion_r276336748
     >
     >                                     [4] Dependency graph for
     >                                     beam-runners-direct-java:
     >
     >                                     beam-runners-direct-java ->
     >                                     sdks-java-harness ->
     >                                     beam-sdks-java-io-kafka
     >                                     -> beam-runners-direct-java
    ... the
     >                                     cycle continues
     >
     >                                     Beam-runners-direct-java
    depends on
     >                                     sdks-java-harness due
     >                                     to the infamous Universal Local
     >                                     Runner.
    Beam-sdks-java-io-kafka depends
     >                                     on beam-runners-direct-java for
     >                                     running tests.
     >
     >                                     [5]
     > https://github.com/apache/beam/pull/8340
     >

Re: Artifact staging in cross-language pipelines

Reply via email to