I think "external" still needs some way (I was suggesting grpc) to pass the control address, etc. to whatever starts up the workers.
Also, +1 to making this a URN. Embedded makes sense too. On Fri, Aug 24, 2018 at 6:00 AM Thomas Weise <t...@apache.org> wrote: > > Option #3 "external" would fit the Kubernetes use case we discussed a while > ago also. Container(s) can be part of the same pod and need to find the > runner. > > There is another option: "embedded". When the SDK is Java and the runner > Flink (or all the other OSS runners), then harness can (optionally) run > embedded in the same JVM. > > Thanks, > Thomas > > > On Thu, Aug 23, 2018 at 9:14 AM Henning Rohde <hero...@google.com> wrote: >> >> A process-based SDK harness does not IMO imply that the host is fully >> provisioned by the SDK/user and invoking the user command line in the >> context of the staged files is a critical aspect for it to work. So I >> consider staged artifact support needed. Also, I would like to suggest that >> we move to a concrete environment proto to crystalize what is actually being >> proposed. I'm not sure what activating a virtualenv would look like, for >> example. To start things off: >> >> message Environment { >> string urn = 1; >> bytes payload = 2; >> } >> >> // urn == "beam:env:docker:v1" >> message DockerPayload { >> string container_image = 1; // implicitly linux_amd64. >> } >> >> // urn == "beam:env:process:v1" >> message ProcessPayload { >> string os = 1; // "linux", "darwin", .. >> string arch = 2; // "amd64", .. >> string command_line = 3; >> } >> >> // urn == "beam:env:external:v1" >> // (no payload) >> >> A runner may support any subset and reject any unsupported configuration. >> There are 3 kinds of environments that I think are useful: >> (1) docker: works as currently. Offers the most flexibility for SDKs and >> users, especially when the runner is outside the control (such as hosted >> runners). The runner starts the SDK harnesses. >> (2) process: as discussed here. The runner starts the SDK harnesses. The >> semantics is that the shell commandline is invoked in a directory rooted in >> the staged artifacts with the container contract arguments. It is up to the >> user and runner deployment to ensure that it makes sense, i.e., on windows a >> linux binary or bash script is not specified. Executing the user command in >> a shell env (bash, zsh, cmd, ..) ensures that paths and so on are set up:, >> i.e., specifying "java -jar foo" would actually work. Useful for cases where >> the user controls both the SDK and runner (such as locally) or when docker >> is not an option. Intended to be minimal and SDK/language agnostic. >> (3) external: this is what I think Robert was alluding to. The runner does >> not start any SDK harnesses. Instead it waits for user-controlled SDK >> harnesses to connect. Useful for manually debugging SDK code (connect from >> code running in a debugger) or when the user code must run in a special or >> privileged environment. It's runner-specific how the SDK will need to >> connect. >> >> Part of the idea of placing this information in the environment is that >> pipelines can potentially use multiple, such as cross-windows/linux. >> >> Henning >> >> On Thu, Aug 23, 2018 at 6:44 AM Thomas Weise <t...@apache.org> wrote: >>> >>> I would see support for staging libraries as optional / nice to have since >>> that can also be done as part of host provisioning (i.e. in the Python case >>> a virtual environment was already setup and just needs to be activated). >>> >>> Depending on how the command that launches the harness is configured, >>> additional steps such as virtualenv activate or setting of other >>> environment variables can be included as well. >>> >>> >>> On Thu, Aug 23, 2018 at 5:15 AM Maximilian Michels <m...@apache.org> wrote: >>>> >>>> Just to recap: >>>> >>>> From this and the other thread ("Bootstraping Beam's Job Server") we >>>> got sufficient evidence that process-based execution is a desired feature. >>>> >>>> Process-based execution as an alternative to dockerized execution >>>> https://issues.apache.org/jira/browse/BEAM-5187 >>>> >>>> Which parts are executed as a process? >>>> => The SDK harness for user code >>>> >>>> What configuration options are supported? >>>> => Provide information about the target architecture (OS/CPU) >>>> => Staging libraries, as also supported by Docker >>>> => Activating a pre-existing environment (e.g. virutalenv) >>>> >>>> >>>> On 23.08.18 14:13, Maximilian Michels wrote: >>>> >> One thing to consider that we've talked about in the past. It might >>>> >> make sense to extend the environment proto and have the SDK be >>>> >> explicit about which kinds of environment it support >>>> > >>>> > +1 Encoding environment information there is a good idea. >>>> > >>>> >> Seems it will create a default docker url even if the >>>> >> hardness_docker_image is set to None in pipeline options. Shall we add >>>> >> another option or honor the None in this option to support the process >>>> >> job? >>>> > >>>> > Yes, if no Docker image is set the default one will be used. Currently >>>> > Docker is the only way to execute pipelines with the PortableRunner. If >>>> > the docker_image is not set, execution won't succeed. >>>> > >>>> > On 22.08.18 22:59, Xinyu Liu wrote: >>>> >> We are also interested in this Process JobBundleFactory as we are >>>> >> planning to fork a process to run python sdk in Samza runner, instead >>>> >> of using docker container. So this change will be helpful to us too. >>>> >> On the same note, we are trying out portable_runner.py to submit a >>>> >> python job. Seems it will create a default docker url even if the >>>> >> hardness_docker_image is set to None in pipeline options. Shall we add >>>> >> another option or honor the None in this option to support the process >>>> >> job? I made some local changes right now to walk around this. >>>> >> >>>> >> Thanks, >>>> >> Xinyu >>>> >> >>>> >> On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde <hero...@google.com >>>> >> <mailto:hero...@google.com>> wrote: >>>> >> >>>> >> By "enum" in quotes, I meant the usual open URN style pattern not an >>>> >> actual enum. Sorry if that wasn't clear. >>>> >> >>>> >> On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik <lc...@google.com >>>> >> <mailto:lc...@google.com>> wrote: >>>> >> >>>> >> I would model the environment to be more free form then enums >>>> >> such that we have forward looking extensibility and would >>>> >> suggest to follow the same pattern we use on PTransforms (using >>>> >> an URN and a URN specific payload). Note that in this case we >>>> >> may want to support a list of supported environments (e.g. java, >>>> >> docker, python, ...). >>>> >> >>>> >> On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde >>>> >> <hero...@google.com <mailto:hero...@google.com>> wrote: >>>> >> >>>> >> One thing to consider that we've talked about in the past. >>>> >> It might make sense to extend the environment proto and have >>>> >> the SDK be explicit about which kinds of environment it >>>> >> supports: >>>> >> >>>> >> >>>> >> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969 >>>> >> >>>> >> >>>> >> <https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969> >>>> >> >>>> >> >>>> >> This choice might impact what files are staged or what not. >>>> >> Some SDKs, such as Go, provide a compiled binary and _need_ >>>> >> to know what the target architecture is. Running on a mac >>>> >> with and without docker, say, requires a different worker in >>>> >> each case. If we add an "enum", we can also easily add the >>>> >> external idea where the SDK/user starts the SDK harnesses >>>> >> instead of the runner. Each runner may not support all types >>>> >> of environments. >>>> >> >>>> >> Henning >>>> >> >>>> >> On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels >>>> >> <m...@apache.org <mailto:m...@apache.org>> wrote: >>>> >> >>>> >> For reference, here is corresponding JIRA issue for this >>>> >> thread: >>>> >> https://issues.apache.org/jira/browse/BEAM-5187 >>>> >> <https://issues.apache.org/jira/browse/BEAM-5187> >>>> >> >>>> >> On 16.08.18 11:15, Maximilian Michels wrote: >>>> >> > Makes sense to have an option to run the SDK harness >>>> >> in a non-dockerized >>>> >> > environment. >>>> >> > >>>> >> > I'm in the process of creating a Docker entry point >>>> >> for Flink's >>>> >> > JobServer[1]. I suppose you would also prefer to >>>> >> execute that one >>>> >> > standalone. We should make sure this is also an >>>> >> option. >>>> >> > >>>> >> > [1] https://issues.apache.org/jira/browse/BEAM-4130 >>>> >> <https://issues.apache.org/jira/browse/BEAM-4130> >>>> >> > >>>> >> > On 16.08.18 07:42, Thomas Weise wrote: >>>> >> >> Yes, that's the proposal. Everything that would >>>> >> otherwise be packaged >>>> >> >> into the Docker container would need to be >>>> >> pre-installed in the host >>>> >> >> environment. In the case of Python SDK, that could >>>> >> simply mean a >>>> >> >> (frozen) virtual environment that was setup when the >>>> >> host was >>>> >> >> provisioned - the SDK harness process(es) will then >>>> >> just utilize that. >>>> >> >> Of course this flavor of SDK harness execution could >>>> >> also be useful in >>>> >> >> the local development environment, where right now >>>> >> someone who already >>>> >> >> has the Python environment needs to also install >>>> >> Docker and update a >>>> >> >> container to launch a Python SDK pipeline on the >>>> >> Flink runner. >>>> >> >> >>>> >> >> On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira >>>> >> <danolive...@google.com <mailto:danolive...@google.com> >>>> >> >> <mailto:danolive...@google.com >>>> >> <mailto:danolive...@google.com>>> wrote: >>>> >> >> >>>> >> >> I just want to clarify that I understand this >>>> >> correctly since I'm >>>> >> >> not that familiar with the details behind all >>>> >> these execution >>>> >> >> environments yet. Is the proposal to create a >>>> >> new JobBundleFactory >>>> >> >> that instead of using Docker to create the >>>> >> environment that the new >>>> >> >> processes will execute in, this >>>> >> JobBundleFactory would execute the >>>> >> >> new processes directly in the host environment? >>>> >> So in practice if I >>>> >> >> ran a pipeline with this JobBundleFactory the >>>> >> SDK Harness and Runner >>>> >> >> Harness would both be executing directly on my >>>> >> machine and would >>>> >> >> depend on me having the dependencies already >>>> >> present on my machine? >>>> >> >> >>>> >> >> On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka >>>> >> <goe...@google.com <mailto:goe...@google.com> >>>> >> >> <mailto:goe...@google.com >>>> >> <mailto:goe...@google.com>>> wrote: >>>> >> >> >>>> >> >> Thanks for starting the discussion. I will >>>> >> be happy to help. >>>> >> >> I agree, we should have pluggable >>>> >> SDKHarness environment Factory. >>>> >> >> We can register multiple Environment >>>> >> factory using service >>>> >> >> registry and use the PipelineOption to pick >>>> >> the right one on per >>>> >> >> job basis. >>>> >> >> >>>> >> >> There are a couple of things which are >>>> >> require to setup before >>>> >> >> launching the process. >>>> >> >> >>>> >> >> * Setting up the environment as done in >>>> >> boot.go [4] >>>> >> >> * Retrieving and putting the artifacts in >>>> >> the right location. >>>> >> >> >>>> >> >> You can probably leverage boot.go code to >>>> >> setup the environment. >>>> >> >> >>>> >> >> Also, it will be useful to enumerate pros >>>> >> and cons of different >>>> >> >> Environments to help users choose the right >>>> >> one. >>>> >> >> >>>> >> >> >>>> >> >> On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise >>>> >> <t...@apache.org <mailto:t...@apache.org> >>>> >> >> <mailto:t...@apache.org >>>> >> <mailto:t...@apache.org>>> wrote: >>>> >> >> >>>> >> >> Hi, >>>> >> >> >>>> >> >> Currently the portable Flink runner >>>> >> only works with SDK >>>> >> >> Docker containers for execution >>>> >> (DockerJobBundleFactory, >>>> >> >> besides an in-process (embedded) >>>> >> factory option for testing >>>> >> >> [1]). I'm considering adding another >>>> >> out of process >>>> >> >> JobBundleFactory implementation that >>>> >> directly forks the >>>> >> >> processes on the task manager host, >>>> >> eliminating the need for >>>> >> >> Docker. This would work reasonably well >>>> >> in environments >>>> >> >> where the dependencies (in this case >>>> >> Python) can easily be >>>> >> >> tied into the host deployment (also >>>> >> within an application >>>> >> >> specific Kubernetes pod). >>>> >> >> >>>> >> >> There was already some discussion about >>>> >> alternative >>>> >> >> JobBundleFactory implementation in [2]. >>>> >> There is also a JIRA >>>> >> >> to make the bundle factory pluggable >>>> >> [3], pending >>>> >> >> availability of runner level options. >>>> >> >> >>>> >> >> For a "ProcessBundleFactory", in >>>> >> addition to the Python >>>> >> >> dependencies the environment would also >>>> >> need to have the Go >>>> >> >> boot executable [4] (or a substitute >>>> >> thereof) to perform the >>>> >> >> harness initialization. >>>> >> >> >>>> >> >> Is anyone else interested in this SDK >>>> >> execution option or >>>> >> >> has already investigated an alternative >>>> >> implementation? >>>> >> >> >>>> >> >> Thanks, >>>> >> >> Thomas >>>> >> >> >>>> >> >> [1] >>>> >> >> >>>> >> >>>> >> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83 >>>> >> >>>> >> >>>> >> <https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83> >>>> >> >>>> >> >> >>>> >> >> [2] >>>> >> >> >>>> >> >>>> >> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E >>>> >> >>>> >> >>>> >> <https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E> >>>> >> >>>> >> >> >>>> >> >> [3] >>>> >> https://issues.apache.org/jira/browse/BEAM-4819 >>>> >> <https://issues.apache.org/jira/browse/BEAM-4819> >>>> >> >> >>>> >> >> [4] >>>> >> >>>> >> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go >>>> >> >>>> >> <https://github.com/apache/beam/blob/master/sdks/python/container/boot.go> >>>> >> >>>> >> >> >>>> >> >>>> >> -- Max >>>> >> >>>> >> >>>> > >>>> >>>> -- >>>> Max