Re: Artifact staging in cross-language pipelines

Thomas Weise Thu, 18 Apr 2019 22:36:36 -0700

Good discussion :)

Initially the expansion service was considered a user responsibility, but I
think that isn't necessarily the case. I can also see the expansion service
provided as part of the infrastructure and the user not wanting to deal
with it at all. For example, users may want to write Python transforms and
use external IOs, without being concerned how these IOs are provided. Under
such scenario it would be good if:


* Expansion service(s) can be auto-discovered via the job service endpoint
* Available external transforms can be discovered via the expansion
service(s)
* Dependencies for external transforms are part of the metadata returned by
expansion service

Dependencies could then be staged either by the SDK client or the expansion
service. The expansion service could provide the locations to stage to the
SDK, it would still be transparent to the user.

I also agree with Luke regarding the environments. Docker is the choice for
generic deployment. Other environments are used when the flexibility
offered by Docker isn't needed (or gets into the way). Then the
dependencies are provided in different ways. Whether these are Python
packages or jar files, by opting out of Docker the decision is made to
manage dependencies externally.

Thomas



On Thu, Apr 18, 2019 at 6:01 PM Chamikara Jayalath <chamik...@google.com>
wrote:

>
>
> On Thu, Apr 18, 2019 at 5:21 PM Chamikara Jayalath <chamik...@google.com>
> wrote:
>
>> Thanks for raising the concern about credentials Ankur, I agree that this
>> is a significant issue.
>>
>> On Thu, Apr 18, 2019 at 4:23 PM Lukasz Cwik <lc...@google.com> wrote:
>>
>>> I can understand the concern about credentials, the same access concern
>>> will exist for several cross language transforms (mostly IOs) since some
>>> will need access to credentials to read/write to an external service.
>>>
>>> Are there any ideas on how credential propagation could work to these
>>> IOs?
>>>
>>
>> There are some cases where existing IO transforms need credentials to
>> access remote resources, for example, size estimation, validation, etc. But
>> usually these are optional (or transform can be configured to not perform
>> these functions).
>>
>
> To clarify, I'm only talking about transform expansion here. Many IO
> transforms need read/write access to remote services at run time. So
> probably we need to figure out a way to propagate these credentials anyways.
>
>
>>
>>
>>> Can we use these mechanisms for staging?
>>>
>>
>> I think we'll have to find a way to do one of (1) propagate credentials
>> to other SDKs (2) allow users to configure SDK containers to have necessary
>> credentials (3) do the artifact staging from the pipeline SDK environment
>> which already have credentials. I prefer (1) or (2) since this will given a
>> transform same feature set whether used directly (in the same SDK language
>> as the transform) or remotely but it might be hard to do this for an
>> arbitrary service that a transform might connect to considering the number
>> of ways users can configure credentials (after an offline discussion with
>> Ankur).
>>
>>
>>>
>>>
>>
>>> On Thu, Apr 18, 2019 at 3:47 PM Ankur Goenka <goe...@google.com> wrote:
>>>
>>>> I agree that the Expansion service knows about the artifacts required
>>>> for a cross language transform and having a prepackage folder/Zip for
>>>> transforms based on language makes sense.
>>>>
>>>> One think to note here is that expansion service might not have the
>>>> same access privilege as the pipeline author and hence might not be able to
>>>> stage artifacts by itself.
>>>> Keeping this in mind I am leaning towards making Expansion service
>>>> provide all the required artifacts to the user and let the user stage the
>>>> artifacts as regular artifacts.
>>>> At this time, we only have Beam File System based artifact staging
>>>> which users local credentials to access different file systems. Even a
>>>> docker based expansion service running on local machine might not have the
>>>> same access privileges.
>>>>
>>>> In brief this is what I am leaning toward.
>>>> User call for pipeline submission -> Expansion service provide cross
>>>> language transforms and relevant artifacts to the Sdk -> Sdk Submits the
>>>> pipeline to Jobserver and Stages user and cross language artifacts to
>>>> artifacts staging service
>>>>
>>>>
>>>> On Thu, Apr 18, 2019 at 2:33 PM Chamikara Jayalath <
>>>> chamik...@google.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 18, 2019 at 2:12 PM Lukasz Cwik <lc...@google.com> wrote:
>>>>>
>>>>>> Note that Max did ask whether making the expansion service do the
>>>>>> staging made sense, and my first line was agreeing with that direction 
>>>>>> and
>>>>>> expanding on how it could be done (so this is really Max's idea or from
>>>>>> whomever he got the idea from).
>>>>>>
>>>>>
>>>>> +1 to what Max said then :)
>>>>>
>>>>>
>>>>>>
>>>>>> I believe a lot of the value of the expansion service is not having
>>>>>> users need to be aware of all the SDK specific dependencies when they are
>>>>>> trying to create a pipeline, only the "user" who is launching the 
>>>>>> expansion
>>>>>> service may need to. And in that case we can have a prepackaged expansion
>>>>>> service application that does what most users would want (e.g. expansion
>>>>>> service as a docker container, a single bundled jar, ...). We (the Apache
>>>>>> Beam community) could choose to host a default implementation of the
>>>>>> expansion service as well.
>>>>>>
>>>>>
>>>>> I'm not against this. But I think this is a secondary more advanced
>>>>> use-case. For a Beam users that needs to use a Java transform that they
>>>>> already have in a Python pipeline, we should provide a way to allow
>>>>> starting up a expansion service (with dependencies needed for that) and
>>>>> running a pipeline that uses this external Java transform (with
>>>>> dependencies that are needed at runtime). Probably, it'll be enough to
>>>>> allow providing all dependencies when starting up the expansion service 
>>>>> and
>>>>> allow expansion service to do the staging of jars are well. I don't see a
>>>>> need to include the list of jars in the ExpansionResponse sent to the
>>>>> Python SDK.
>>>>>
>>>>>
>>>>>>
>>>>>> On Thu, Apr 18, 2019 at 2:02 PM Chamikara Jayalath <
>>>>>> chamik...@google.com> wrote:
>>>>>>
>>>>>>> I think there are two kind of dependencies we have to consider.
>>>>>>>
>>>>>>> (1) Dependencies that are needed to expand the transform.
>>>>>>>
>>>>>>> These have to be provided when we start the expansion service so
>>>>>>> that available external transforms are correctly registered with the
>>>>>>> expansion service.
>>>>>>>
>>>>>>> (2) Dependencies that are not needed at expansion but may be needed
>>>>>>> at runtime.
>>>>>>>
>>>>>>> I think in both cases, users have to provide these dependencies
>>>>>>> either when expansion service is started or when a pipeline is being
>>>>>>> executed.
>>>>>>>
>>>>>>> Max, I'm not sure why expansion service will need to provide
>>>>>>> dependencies to the user since user will already be aware of these. Are 
>>>>>>> you
>>>>>>> talking about a expansion service that is readily available that will be
>>>>>>> used by many Beam users ? I think such a (possibly long running) service
>>>>>>> will have to maintain a repository of transforms and should have 
>>>>>>> mechanism
>>>>>>> for registering new transforms and discovering already registered
>>>>>>> transforms etc. I think there's more design work needed to make 
>>>>>>> transform
>>>>>>> expansion service support such use-cases. Currently, I think allowing
>>>>>>> pipeline author to provide the jars when starting the expansion service 
>>>>>>> and
>>>>>>> when executing the pipeline will be adequate.
>>>>>>>
>>>>>>> Regarding the entity that will perform the staging, I like Luke's
>>>>>>> idea of allowing expansion service to do the staging (of jars provided 
>>>>>>> by
>>>>>>> the user). Notion of artifacts and how they are extracted/represented is
>>>>>>> SDK dependent. So if the pipeline SDK tries to do this we have to add n 
>>>>>>> x
>>>>>>> (n -1) configurations (for n SDKs).
>>>>>>>
>>>>>>> - Cham
>>>>>>>
>>>>>>> On Thu, Apr 18, 2019 at 11:45 AM Lukasz Cwik <lc...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> We can expose the artifact staging endpoint and artifact token to
>>>>>>>> allow the expansion service to upload any resources its environment may
>>>>>>>> need. For example, the expansion service for the Beam Java SDK would be
>>>>>>>> able to upload jars.
>>>>>>>>
>>>>>>>> In the "docker" environment, the Apache Beam Java SDK harness
>>>>>>>> container would fetch the relevant artifacts for itself and be able to
>>>>>>>> execute the pipeline. (Note that a docker environment could skip all 
>>>>>>>> this
>>>>>>>> artifact staging if the docker environment contained all necessary
>>>>>>>> artifacts).
>>>>>>>>
>>>>>>>> For the existing "external" environment, it should already come
>>>>>>>> with all the resources prepackaged wherever "external" points to. The
>>>>>>>> "process" based environment could choose to use the artifact staging
>>>>>>>> service to fetch those resources associated with its process or it 
>>>>>>>> could
>>>>>>>> follow the same pattern that "external" would do and already contain 
>>>>>>>> all
>>>>>>>> the prepackaged resources. Note that both "external" and "process" will
>>>>>>>> require the instance of the expansion service to be specialized for 
>>>>>>>> those
>>>>>>>> environments which is why the default should for the expansion service 
>>>>>>>> to
>>>>>>>> be the "docker" environment.
>>>>>>>>
>>>>>>>> Note that a major reason for going with docker containers as the
>>>>>>>> environment that all runners should support is that containers 
>>>>>>>> provides a
>>>>>>>> solution for this exact issue. Both the "process" and "external"
>>>>>>>> environments are explicitly limiting and expanding their capabilities 
>>>>>>>> will
>>>>>>>> quickly have us building something like a docker container because 
>>>>>>>> we'll
>>>>>>>> quickly find ourselves solving the same problems that docker containers
>>>>>>>> provide (resources, file layout, permissions, ...)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Apr 18, 2019 at 11:21 AM Maximilian Michels <m...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> We have previously merged support for configuring transforms
>>>>>>>>> across
>>>>>>>>> languages. Please see Cham's summary on the discussion [1]. There
>>>>>>>>> is
>>>>>>>>> also a design document [2].
>>>>>>>>>
>>>>>>>>> Subsequently, we've added wrappers for cross-language transforms
>>>>>>>>> to the
>>>>>>>>> Python SDK, i.e. GenerateSequence, ReadFromKafka, and there is a
>>>>>>>>> pending
>>>>>>>>> PR [1] for WriteToKafka. All of them utilize Java transforms via
>>>>>>>>> cross-language configuration.
>>>>>>>>>
>>>>>>>>> That is all pretty exciting :)
>>>>>>>>>
>>>>>>>>> We still have some issues to solve, one being how to stage
>>>>>>>>> artifact from
>>>>>>>>> a foreign environment. When we run external transforms which are
>>>>>>>>> part of
>>>>>>>>> Beam's core (e.g. GenerateSequence), we have them available in the
>>>>>>>>> SDK
>>>>>>>>> Harness. However, when they are not (e.g. KafkaIO) we need to
>>>>>>>>> stage the
>>>>>>>>> necessary files.
>>>>>>>>>
>>>>>>>>> For my PR [3] I've naively added ":beam-sdks-java-io-kafka" to the
>>>>>>>>> SDK
>>>>>>>>> Harness which caused dependency problems [4]. Those could be
>>>>>>>>> resolved
>>>>>>>>> but the bigger question is how to stage artifacts for external
>>>>>>>>> transforms programmatically?
>>>>>>>>>
>>>>>>>>> Heejong has solved this by adding a "--jar_package" option to the
>>>>>>>>> Python
>>>>>>>>> SDK to stage Java files [5]. I think that is a better solution
>>>>>>>>> than
>>>>>>>>> adding required Jars to the SDK Harness directly, but it is not
>>>>>>>>> very
>>>>>>>>> convenient for users.
>>>>>>>>>
>>>>>>>>> I've discussed this today with Thomas and we both figured that the
>>>>>>>>> expansion service needs to provide a list of required Jars with
>>>>>>>>> the
>>>>>>>>> ExpansionResponse it provides. It's not entirely clear, how we
>>>>>>>>> determine
>>>>>>>>> which artifacts are necessary for an external transform. We could
>>>>>>>>> just
>>>>>>>>> dump the entire classpath like we do in PipelineResources for Java
>>>>>>>>> pipelines. This provides many unneeded classes but would work.
>>>>>>>>>
>>>>>>>>> Do you think it makes sense for the expansion service to provide
>>>>>>>>> the
>>>>>>>>> artifacts? Perhaps you have a better idea how to resolve the
>>>>>>>>> staging
>>>>>>>>> problem in cross-language pipelines?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Max
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>>
>>>>>>>>> https://lists.apache.org/thread.html/b99ba8527422e31ec7bb7ad9dc3a6583551ea392ebdc5527b5fb4a67@%3Cdev.beam.apache.org%3E
>>>>>>>>>
>>>>>>>>> [2] https://s.apache.org/beam-cross-language-io
>>>>>>>>>
>>>>>>>>> [3] https://github.com/apache/beam/pull/8322#discussion_r276336748
>>>>>>>>>
>>>>>>>>> [4] Dependency graph for beam-runners-direct-java:
>>>>>>>>>
>>>>>>>>> beam-runners-direct-java -> sdks-java-harness ->
>>>>>>>>> beam-sdks-java-io-kafka
>>>>>>>>> -> beam-runners-direct-java ... the cycle continues
>>>>>>>>>
>>>>>>>>> Beam-runners-direct-java depends on sdks-java-harness due
>>>>>>>>> to the infamous Universal Local Runner. Beam-sdks-java-io-kafka
>>>>>>>>> depends
>>>>>>>>> on beam-runners-direct-java for running tests.
>>>>>>>>>
>>>>>>>>> [5] https://github.com/apache/beam/pull/8340
>>>>>>>>>
>>>>>>>>

Re: Artifact staging in cross-language pipelines

Reply via email to