One thing to consider that we've talked about in the past. It might make sense
to extend the environment proto and have the SDK be explicit about which kinds
of environment it support
+1 Encoding environment information there is a good idea.
Seems it will create a default docker url even if the
hardness_docker_image is set to None in pipeline options. Shall we add
another option or honor the None in this option to support the process
job?
Yes, if no Docker image is set the default one will be used. Currently
Docker is the only way to execute pipelines with the PortableRunner. If
the docker_image is not set, execution won't succeed.
On 22.08.18 22:59, Xinyu Liu wrote:
We are also interested in this Process JobBundleFactory as we are
planning to fork a process to run python sdk in Samza runner, instead of
using docker container. So this change will be helpful to us too. On the
same note, we are trying out portable_runner.py to submit a python job.
Seems it will create a default docker url even if the
hardness_docker_image is set to None in pipeline options. Shall we add
another option or honor the None in this option to support the process
job? I made some local changes right now to walk around this.
Thanks,
Xinyu
On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde <hero...@google.com
<mailto:hero...@google.com>> wrote:
By "enum" in quotes, I meant the usual open URN style pattern not an
actual enum. Sorry if that wasn't clear.
On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik <lc...@google.com
<mailto:lc...@google.com>> wrote:
I would model the environment to be more free form then enums
such that we have forward looking extensibility and would
suggest to follow the same pattern we use on PTransforms (using
an URN and a URN specific payload). Note that in this case we
may want to support a list of supported environments (e.g. java,
docker, python, ...).
On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde
<hero...@google.com <mailto:hero...@google.com>> wrote:
One thing to consider that we've talked about in the past.
It might make sense to extend the environment proto and have
the SDK be explicit about which kinds of environment it
supports:
https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
<https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969>
This choice might impact what files are staged or what not.
Some SDKs, such as Go, provide a compiled binary and _need_
to know what the target architecture is. Running on a mac
with and without docker, say, requires a different worker in
each case. If we add an "enum", we can also easily add the
external idea where the SDK/user starts the SDK harnesses
instead of the runner. Each runner may not support all types
of environments.
Henning
On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels
<m...@apache.org <mailto:m...@apache.org>> wrote:
For reference, here is corresponding JIRA issue for this
thread:
https://issues.apache.org/jira/browse/BEAM-5187
<https://issues.apache.org/jira/browse/BEAM-5187>
On 16.08.18 11:15, Maximilian Michels wrote:
> Makes sense to have an option to run the SDK harness
in a non-dockerized
> environment.
>
> I'm in the process of creating a Docker entry point
for Flink's
> JobServer[1]. I suppose you would also prefer to
execute that one
> standalone. We should make sure this is also an option.
>
> [1] https://issues.apache.org/jira/browse/BEAM-4130
<https://issues.apache.org/jira/browse/BEAM-4130>
>
> On 16.08.18 07:42, Thomas Weise wrote:
>> Yes, that's the proposal. Everything that would
otherwise be packaged
>> into the Docker container would need to be
pre-installed in the host
>> environment. In the case of Python SDK, that could
simply mean a
>> (frozen) virtual environment that was setup when the
host was
>> provisioned - the SDK harness process(es) will then
just utilize that.
>> Of course this flavor of SDK harness execution could
also be useful in
>> the local development environment, where right now
someone who already
>> has the Python environment needs to also install
Docker and update a
>> container to launch a Python SDK pipeline on the
Flink runner.
>>
>> On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira
<danolive...@google.com <mailto:danolive...@google.com>
>> <mailto:danolive...@google.com
<mailto:danolive...@google.com>>> wrote:
>>
>> I just want to clarify that I understand this
correctly since I'm
>> not that familiar with the details behind all
these execution
>> environments yet. Is the proposal to create a
new JobBundleFactory
>> that instead of using Docker to create the
environment that the new
>> processes will execute in, this
JobBundleFactory would execute the
>> new processes directly in the host environment?
So in practice if I
>> ran a pipeline with this JobBundleFactory the
SDK Harness and Runner
>> Harness would both be executing directly on my
machine and would
>> depend on me having the dependencies already
present on my machine?
>>
>> On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka
<goe...@google.com <mailto:goe...@google.com>
>> <mailto:goe...@google.com
<mailto:goe...@google.com>>> wrote:
>>
>> Thanks for starting the discussion. I will
be happy to help.
>> I agree, we should have pluggable
SDKHarness environment Factory.
>> We can register multiple Environment
factory using service
>> registry and use the PipelineOption to pick
the right one on per
>> job basis.
>>
>> There are a couple of things which are
require to setup before
>> launching the process.
>>
>> * Setting up the environment as done in
boot.go [4]
>> * Retrieving and putting the artifacts in
the right location.
>>
>> You can probably leverage boot.go code to
setup the environment.
>>
>> Also, it will be useful to enumerate pros
and cons of different
>> Environments to help users choose the right
one.
>>
>>
>> On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise
<t...@apache.org <mailto:t...@apache.org>
>> <mailto:t...@apache.org
<mailto:t...@apache.org>>> wrote:
>>
>> Hi,
>>
>> Currently the portable Flink runner
only works with SDK
>> Docker containers for execution
(DockerJobBundleFactory,
>> besides an in-process (embedded)
factory option for testing
>> [1]). I'm considering adding another
out of process
>> JobBundleFactory implementation that
directly forks the
>> processes on the task manager host,
eliminating the need for
>> Docker. This would work reasonably well
in environments
>> where the dependencies (in this case
Python) can easily be
>> tied into the host deployment (also
within an application
>> specific Kubernetes pod).
>>
>> There was already some discussion about
alternative
>> JobBundleFactory implementation in [2].
There is also a JIRA
>> to make the bundle factory pluggable
[3], pending
>> availability of runner level options.
>>
>> For a "ProcessBundleFactory", in
addition to the Python
>> dependencies the environment would also
need to have the Go
>> boot executable [4] (or a substitute
thereof) to perform the
>> harness initialization.
>>
>> Is anyone else interested in this SDK
execution option or
>> has already investigated an alternative
implementation?
>>
>> Thanks,
>> Thomas
>>
>> [1]
>>
https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
<https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83>
>>
>> [2]
>>
https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
<https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E>
>>
>> [3]
https://issues.apache.org/jira/browse/BEAM-4819
<https://issues.apache.org/jira/browse/BEAM-4819>
>>
>> [4]
https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
<https://github.com/apache/beam/blob/master/sdks/python/container/boot.go>
>>
--
Max
--
Max