Re: Process JobBundleFactory for portable runner

Maximilian Michels Thu, 23 Aug 2018 05:13:48 -0700

One thing to consider that we've talked about in the past. It might make sense 
to extend the environment proto and have the SDK be explicit about which kinds 
of environment it support


+1 Encoding environment information there is a good idea.

Seems it will create a default docker url even if thehardness_docker_image is set to None in pipeline options. Shall we addanother option or honor the None in this option to support the processjob?

Yes, if no Docker image is set the default one will be used. CurrentlyDocker is the only way to execute pipelines with the PortableRunner. Ifthe docker_image is not set, execution won't succeed.


On 22.08.18 22:59, Xinyu Liu wrote:

We are also interested in this Process JobBundleFactory as we areplanning to fork a process to run python sdk in Samza runner, instead ofusing docker container. So this change will be helpful to us too. On thesame note, we are trying out portable_runner.py to submit a python job.Seems it will create a default docker url even if thehardness_docker_image is set to None in pipeline options. Shall we addanother option or honor the None in this option to support the processjob? I made some local changes right now to walk around this.


Thanks,
Xinyu

On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde <hero...@google.com<mailto:hero...@google.com>> wrote:


    By "enum" in quotes, I meant the usual open URN style pattern not an
    actual enum. Sorry if that wasn't clear.

    On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik <lc...@google.com
    <mailto:lc...@google.com>> wrote:

        I would model the environment to be more free form then enums
        such that we have forward looking extensibility and would
        suggest to follow the same pattern we use on PTransforms (using
        an URN and a URN specific payload). Note that in this case we
        may want to support a list of supported environments (e.g. java,
        docker, python, ...).

        On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde
        <hero...@google.com <mailto:hero...@google.com>> wrote:

            One thing to consider that we've talked about in the past.
            It might make sense to extend the environment proto and have
            the SDK be explicit about which kinds of environment it
            supports:

            
https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
            
<https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969>

            This choice might impact what files are staged or what not.
            Some SDKs, such as Go, provide a compiled binary and _need_
            to know what the target architecture is. Running on a mac
            with and without docker, say, requires a different worker in
            each case. If we add an "enum", we can also easily add the
            external idea where the SDK/user starts the SDK harnesses
            instead of the runner. Each runner may not support all types
            of environments.

            Henning

            On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels
            <m...@apache.org <mailto:m...@apache.org>> wrote:

                For reference, here is corresponding JIRA issue for this
                thread:
                https://issues.apache.org/jira/browse/BEAM-5187
                <https://issues.apache.org/jira/browse/BEAM-5187>

                On 16.08.18 11:15, Maximilian Michels wrote:
                 > Makes sense to have an option to run the SDK harness
                in a non-dockerized
                 > environment.
                 >
                 > I'm in the process of creating a Docker entry point
                for Flink's
                 > JobServer[1]. I suppose you would also prefer to
                execute that one
                 > standalone. We should make sure this is also an option.
                 >
                 > [1] https://issues.apache.org/jira/browse/BEAM-4130
                <https://issues.apache.org/jira/browse/BEAM-4130>
                 >
                 > On 16.08.18 07:42, Thomas Weise wrote:
                 >> Yes, that's the proposal. Everything that would
                otherwise be packaged
                 >> into the Docker container would need to be
                pre-installed in the host
                 >> environment. In the case of Python SDK, that could
                simply mean a
                 >> (frozen) virtual environment that was setup when the
                host was
                 >> provisioned - the SDK harness process(es) will then
                just utilize that.
                 >> Of course this flavor of SDK harness execution could
                also be useful in
                 >> the local development environment, where right now
                someone who already
                 >> has the Python environment needs to also install
                Docker and update a
                 >> container to launch a Python SDK pipeline on the
                Flink runner.
                 >>
                 >> On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira
                <danolive...@google.com <mailto:danolive...@google.com>
                 >> <mailto:danolive...@google.com
                <mailto:danolive...@google.com>>> wrote:
                 >>
                 >>      I just want to clarify that I understand this
                correctly since I'm
                 >>      not that familiar with the details behind all
                these execution
                 >>      environments yet. Is the proposal to create a
                new JobBundleFactory
                 >>      that instead of using Docker to create the
                environment that the new
                 >>      processes will execute in, this
                JobBundleFactory would execute the
                 >>      new processes directly in the host environment?
                So in practice if I
                 >>      ran a pipeline with this JobBundleFactory the
                SDK Harness and Runner
                 >>      Harness would both be executing directly on my
                machine and would
                 >>      depend on me having the dependencies already
                present on my machine?
                 >>
                 >>      On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka
                <goe...@google.com <mailto:goe...@google.com>
                 >>      <mailto:goe...@google.com
                <mailto:goe...@google.com>>> wrote:
                 >>
                 >>          Thanks for starting the discussion. I will
                be happy to help.
                 >>          I agree, we should have pluggable
                SDKHarness environment Factory.
                 >>          We can register multiple Environment
                factory using service
                 >>          registry and use the PipelineOption to pick
                the right one on per
                 >>          job basis.
                 >>
                 >>          There are a couple of things which are
                require to setup before
                 >>          launching the process.
                 >>
                 >>            * Setting up the environment as done in
                boot.go [4]
                 >>            * Retrieving and putting the artifacts in
                the right location.
                 >>
                 >>          You can probably leverage boot.go code to
                setup the environment.
                 >>
                 >>          Also, it will be useful to enumerate pros
                and cons of different
                 >>          Environments to help users choose the right
                one.
                 >>
                 >>
                 >>          On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise
                <t...@apache.org <mailto:t...@apache.org>
                 >>          <mailto:t...@apache.org
                <mailto:t...@apache.org>>> wrote:
                 >>
                 >>              Hi,
                 >>
                 >>              Currently the portable Flink runner
                only works with SDK
                 >>              Docker containers for execution
                (DockerJobBundleFactory,
                 >>              besides an in-process (embedded)
                factory option for testing
                 >>              [1]). I'm considering adding another
                out of process
                 >>              JobBundleFactory implementation that
                directly forks the
                 >>              processes on the task manager host,
                eliminating the need for
                 >>              Docker. This would work reasonably well
                in environments
                 >>              where the dependencies (in this case
                Python) can easily be
                 >>              tied into the host deployment (also
                within an application
                 >>              specific Kubernetes pod).
                 >>
                 >>              There was already some discussion about
                alternative
                 >>              JobBundleFactory implementation in [2].
                There is also a JIRA
                 >>              to make the bundle factory pluggable
                [3], pending
                 >>              availability of runner level options.
                 >>
                 >>              For a "ProcessBundleFactory", in
                addition to the Python
                 >>              dependencies the environment would also
                need to have the Go
                 >>              boot executable [4] (or a substitute
                thereof) to perform the
                 >>              harness initialization.
                 >>
                 >>              Is anyone else interested in this SDK
                execution option or
                 >>              has already investigated an alternative
                implementation?
                 >>
                 >>              Thanks,
                 >>              Thomas
                 >>
                 >>              [1]
                 >>
                
https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
                
<https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83>
                 >>
                 >>              [2]
                 >>
                
https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
                
<https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E>
                 >>
                 >>              [3]
                https://issues.apache.org/jira/browse/BEAM-4819
                <https://issues.apache.org/jira/browse/BEAM-4819>
                 >>
                 >>              [4]
                
https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
                
<https://github.com/apache/beam/blob/master/sdks/python/container/boot.go>
                 >>

--Max


--
Max

Re: Process JobBundleFactory for portable runner

Reply via email to