Re: Graal instead of docker?

2018-05-12 Thread Romain Manni-Bucau
Im too far from a computer to write the needed mail here but i didnt intend
to hurt anyone. I was referring to current code and trying to overlap users
and needed companies integration.

Will try to clarify it more when ill get something else than a phone to
answer but please dont get hurt by any sentence, it doesnt serve anyone.


Le sam. 12 mai 2018 07:33, Davor Bonaci  a écrit :

> This thread is extremely valuable. It poses hard questions. It strengthens
> good arguments. It teaches a way of thinking. It gives feedback. I want to
> thank Romain in particular for driving it, and everyone who has
> participated thus far.
>
> That being said, the exchange has crossed the line on behalf of multiple
> actors. I request a pause of 72 hours from *everyone*. It will help to cool
> down, and digest the conversation so far. In addition, the PMC would
> appreciate that time to process things and potentially advise/steer the
> conversation.
>
> On Fri, May 11, 2018 at 12:42 PM, Kenneth Knowles  wrote:
>
>> Romain,
>>
>> You probably did not mean to, but I think this message crosses outside
>> the expected code of conduct.
>>
>> On Fri, May 11, 2018 at 11:48 AM Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>>
>>> Also beam community is java - dont answer it is python or go without
>>> checking ;). Not sure adding a new language will help and give a face
>>> people will like to contribute or use the project.
>>>
>>
>> The Beam community includes contributors and users of the Java, Python,
>> and Go SDKs. This remark denigrates the work of people building and using
>> Beam in Python and Go. Please be careful in the words that you choose and,
>> most importantly, please be open, empathetic, and welcoming to these
>> members of our community.
>>
>> Kenn
>>
>>
>
>


Re: Graal instead of docker?

2018-05-12 Thread Davor Bonaci
This thread is extremely valuable. It poses hard questions. It strengthens
good arguments. It teaches a way of thinking. It gives feedback. I want to
thank Romain in particular for driving it, and everyone who has
participated thus far.

That being said, the exchange has crossed the line on behalf of multiple
actors. I request a pause of 72 hours from *everyone*. It will help to cool
down, and digest the conversation so far. In addition, the PMC would
appreciate that time to process things and potentially advise/steer the
conversation.

On Fri, May 11, 2018 at 12:42 PM, Kenneth Knowles  wrote:

> Romain,
>
> You probably did not mean to, but I think this message crosses outside the
> expected code of conduct.
>
> On Fri, May 11, 2018 at 11:48 AM Romain Manni-Bucau 
> wrote:
>
>>
>> Also beam community is java - dont answer it is python or go without
>> checking ;). Not sure adding a new language will help and give a face
>> people will like to contribute or use the project.
>>
>
> The Beam community includes contributors and users of the Java, Python,
> and Go SDKs. This remark denigrates the work of people building and using
> Beam in Python and Go. Please be careful in the words that you choose and,
> most importantly, please be open, empathetic, and welcoming to these
> members of our community.
>
> Kenn
>
>


Re: Graal instead of docker?

2018-05-11 Thread Kenneth Knowles
Romain,

You probably did not mean to, but I think this message crosses outside the
expected code of conduct.

On Fri, May 11, 2018 at 11:48 AM Romain Manni-Bucau 
wrote:

>
> Also beam community is java - dont answer it is python or go without
> checking ;). Not sure adding a new language will help and give a face
> people will like to contribute or use the project.
>

The Beam community includes contributors and users of the Java, Python, and
Go SDKs. This remark denigrates the work of people building and using Beam
in Python and Go. Please be careful in the words that you choose and, most
importantly, please be open, empathetic, and welcoming to these members of
our community.

Kenn


Re: Graal instead of docker?

2018-05-11 Thread Eugene Kirpichov
On Fri, May 11, 2018 at 11:48 AM Romain Manni-Bucau 
wrote:

>
>
> Le ven. 11 mai 2018 18:15, Andrew Pilloud  a écrit :
>
>> Json and Protobuf aren't the same thing. Json is for exchanging
>> unstructured data, Protobuf is for exchanging structured data. The point of
>> Portability is to define a protocol for exchanging structured messages
>> across languages. What do you propose using on top of Json to define
>> message structure?
>>
>
> Im fine with protobuf contracts, not with all the rest (libs*). Json has
> the advantage to not require much for consumers and be easy to integrate
> and proxy. Protobuf imposes a lot for that layer which will be typed by the
> runner anyway so no need of 2 typings layers.
>
>
>> I'd like to see the generic runner rewritten in Golang so we can
>> eliminate the significant overhead imposed by the JVM. I would argue that
>> Go is the best language for low overhead infrastructure, and is already
>> widely used by projects in this space such as Docker, Kubernetes, InfluxDB.
>> Even SQL can take advantage of this. For example, several runners could be
>> passed raw SQL and use their own SQL engines to implement more efficient
>> transforms then generic Beam can. Users will save significant $$$ on
>> infrastructure by not having to involve the JVM at all.
>>
>
> Yes...or no. Jvm overhead is very low gor such infra, less than 64M of ram
> and almost no cpu so will not help much for cluster or long lived processes
> like the ones we talk about.
>
> Also beam community is java - dont answer it is python or go without
> checking ;). Not sure adding a new language will help and give a face
> people will like to contribute or use the project.
>
> Currently id say the way the router runner is done is a detail but the
> choice to rethink current impl a crucial atchitectural point.
>
> No issue having N router impls too, one in java, one in go (but thought we
> had very few go resources/lovers in beam?), one in python where it would
> make a lot of sense and would show a router without actual primitive impl
> (delegating to direct java runner), etc...
>
> But one thing at a time, anyone against stopping current impl track,
> revert it and move to a higher level runner?
>
I am still not clear as to exactly what kind of change you are proposing.
First it looked like you were proposing to not have a hard dependency on
Docker, and that got resolved (we don't). Now it sounds like you're against
Java protobuf libraries, but that doesn't strike me as something warranting
an architecture change. If you have something else in mind, I'm afraid from
your recent emails I can't tell what it is.

Could you please create a document detailing:
- What precisely are the issues you see with some of the current
portability APIs
- How you think those APIs should look like instead
- How you think the path from the current implementation to your desired
state would look like
- If you're proposing major changes to the direction of work of a large
number of people, then please also elaborate in your document as to what
impact your proposal will have on the current work, or how this impact can
be minimized.

Please make sure to scan the pre-existing portability design documents to
see if similar concerns had already been discussed before. As others have
pointed out in this thread several times, almost everything that you're
asking has already been discussed. If you believe the discussion of a
particular issue has been insufficient, feel free to re-raise it on the
mailing list, by linking to the previous discussion and elaborating what
aspect you think has been missed; if you can't find the discussion of a
particular crucial design decision, feel free to raise that on the mailing
list too and people will be happy to help you find it.

I would like also to ask you to adjust the tone of your comments such as
"beam is being driven by an implementation instead of a clear and scalable
architecture", "All the work done looks like an implemzntation detail of
one runner+vendor corrupting all the project" and "A bad architecture which
doeent embrace the community". These kinds of comments, to me, sound not
only unconstructively vague, but dismissive of the years of design and
implementation work done by dozens of people in this area. We are doing
something that's never done before, and the APIs and implementation are not
perfect and will continue evolving, but there are much more effective and
friendly ways to point out use cases where they fail or ways in which they
can be improved.


>
>
>> Andrew
>>
>> On Fri, May 11, 2018 at 8:53 AM Romain Manni-Bucau 
>> wrote:
>>
>>>
>>>
>>> Le mer. 9 mai 2018 17:41, Eugene Kirpichov  a
>>> écrit :
>>>


 On Wed, May 9, 2018 at 1:08 AM Romain Manni-Bucau <
 rmannibu...@gmail.com> wrote:

>
>
> Le mer. 9 mai 2018 00:57, Henning Rohde  a écrit :

Re: Graal instead of docker?

2018-05-11 Thread Reuven Lax
Romain, if we are specifically discussing the use of protocol buffers and
gRPC, this is the result of community discussion on the dev list back in
2016. Many options were considered: JSON, Thrift, Kryo, and proto among
them. The decision that protocol buffers and gRPC were the best solutions
for the portability fnAPI was arrived at via a community discussion that
many people took part in.

Reuven

On Fri, May 11, 2018 at 11:48 AM Romain Manni-Bucau 
wrote:

>
>
> Le ven. 11 mai 2018 18:15, Andrew Pilloud  a écrit :
>
>> Json and Protobuf aren't the same thing. Json is for exchanging
>> unstructured data, Protobuf is for exchanging structured data. The point of
>> Portability is to define a protocol for exchanging structured messages
>> across languages. What do you propose using on top of Json to define
>> message structure?
>>
>
> Im fine with protobuf contracts, not with all the rest (libs*). Json has
> the advantage to not require much for consumers and be easy to integrate
> and proxy. Protobuf imposes a lot for that layer which will be typed by the
> runner anyway so no need of 2 typings layers.
>
>
>> I'd like to see the generic runner rewritten in Golang so we can
>> eliminate the significant overhead imposed by the JVM. I would argue that
>> Go is the best language for low overhead infrastructure, and is already
>> widely used by projects in this space such as Docker, Kubernetes, InfluxDB.
>> Even SQL can take advantage of this. For example, several runners could be
>> passed raw SQL and use their own SQL engines to implement more efficient
>> transforms then generic Beam can. Users will save significant $$$ on
>> infrastructure by not having to involve the JVM at all.
>>
>
> Yes...or no. Jvm overhead is very low gor such infra, less than 64M of ram
> and almost no cpu so will not help much for cluster or long lived processes
> like the ones we talk about.
>
> Also beam community is java - dont answer it is python or go without
> checking ;). Not sure adding a new language will help and give a face
> people will like to contribute or use the project.
>
> Currently id say the way the router runner is done is a detail but the
> choice to rethink current impl a crucial atchitectural point.
>
> No issue having N router impls too, one in java, one in go (but thought we
> had very few go resources/lovers in beam?), one in python where it would
> make a lot of sense and would show a router without actual primitive impl
> (delegating to direct java runner), etc...
>
> But one thing at a time, anyone against stopping current impl track,
> revert it and move to a higher level runner?
>
>
>> Andrew
>>
>> On Fri, May 11, 2018 at 8:53 AM Romain Manni-Bucau 
>> wrote:
>>
>>>
>>>
>>> Le mer. 9 mai 2018 17:41, Eugene Kirpichov  a
>>> écrit :
>>>


 On Wed, May 9, 2018 at 1:08 AM Romain Manni-Bucau <
 rmannibu...@gmail.com> wrote:

>
>
> Le mer. 9 mai 2018 00:57, Henning Rohde  a écrit :
>
>> There are indeed lots of possibilities for interesting docker
>> alternatives with different tradeoffs and capabilities, but in generally
>> both the runner as well as the SDK must support them for it to work. As
>> mentioned, docker (as used in the container contract) is meant as a
>> flexible main option but not necessarily the only option. I see no 
>> problem
>> with certain pipeline-SDK-runner combinations additionally supporting a
>> specialized setup. Pipeline can be a factor, because that some transforms
>> might depend on aspects of the runtime environment -- such as system
>> libraries or shelling out to a /bin/foo.
>>
>> The worker boot code is tied to the current container contract, so
>> pre-launched workers would presumably not use that code path and are not 
>> be
>> bound by its assumptions. In particular, such a setup might want to 
>> invert
>> who initiates the connection from the SDK worker to the runner. Pipeline
>> options and global state in the SDK and user functions process might make
>> it difficult to safely reuse worker processes across pipelines, but also
>> doable in certain scenarios.
>>
>
> This is not that hard actually and most java env do it.
>
> Main concern is 1. Being tight to an impl detail and 2. A bad
> architecture which doeent embrace the community
>
 Could you please be more specific? Concerns about Docker dependency
 have already been repeatedly addressed in this thread.

>>>
>>> My concern is that beam is being driven by an implementation instead of
>>> a clear and scalable architecture.
>>>
>>> The best demonstration is the protobuf usage which is far to be the best
>>> choice for portability these days due to the implication of its stack in
>>> several languages (nobody wants it in its classpath in java/scala these
>>> 

Re: Graal instead of docker?

2018-05-11 Thread Romain Manni-Bucau
Le ven. 11 mai 2018 18:15, Andrew Pilloud  a écrit :

> Json and Protobuf aren't the same thing. Json is for exchanging
> unstructured data, Protobuf is for exchanging structured data. The point of
> Portability is to define a protocol for exchanging structured messages
> across languages. What do you propose using on top of Json to define
> message structure?
>

Im fine with protobuf contracts, not with all the rest (libs*). Json has
the advantage to not require much for consumers and be easy to integrate
and proxy. Protobuf imposes a lot for that layer which will be typed by the
runner anyway so no need of 2 typings layers.


> I'd like to see the generic runner rewritten in Golang so we can eliminate
> the significant overhead imposed by the JVM. I would argue that Go is the
> best language for low overhead infrastructure, and is already widely used
> by projects in this space such as Docker, Kubernetes, InfluxDB. Even SQL
> can take advantage of this. For example, several runners could be passed
> raw SQL and use their own SQL engines to implement more efficient
> transforms then generic Beam can. Users will save significant $$$ on
> infrastructure by not having to involve the JVM at all.
>

Yes...or no. Jvm overhead is very low gor such infra, less than 64M of ram
and almost no cpu so will not help much for cluster or long lived processes
like the ones we talk about.

Also beam community is java - dont answer it is python or go without
checking ;). Not sure adding a new language will help and give a face
people will like to contribute or use the project.

Currently id say the way the router runner is done is a detail but the
choice to rethink current impl a crucial atchitectural point.

No issue having N router impls too, one in java, one in go (but thought we
had very few go resources/lovers in beam?), one in python where it would
make a lot of sense and would show a router without actual primitive impl
(delegating to direct java runner), etc...

But one thing at a time, anyone against stopping current impl track, revert
it and move to a higher level runner?


> Andrew
>
> On Fri, May 11, 2018 at 8:53 AM Romain Manni-Bucau 
> wrote:
>
>>
>>
>> Le mer. 9 mai 2018 17:41, Eugene Kirpichov  a
>> écrit :
>>
>>>
>>>
>>> On Wed, May 9, 2018 at 1:08 AM Romain Manni-Bucau 
>>> wrote:
>>>


 Le mer. 9 mai 2018 00:57, Henning Rohde  a écrit :

> There are indeed lots of possibilities for interesting docker
> alternatives with different tradeoffs and capabilities, but in generally
> both the runner as well as the SDK must support them for it to work. As
> mentioned, docker (as used in the container contract) is meant as a
> flexible main option but not necessarily the only option. I see no problem
> with certain pipeline-SDK-runner combinations additionally supporting a
> specialized setup. Pipeline can be a factor, because that some transforms
> might depend on aspects of the runtime environment -- such as system
> libraries or shelling out to a /bin/foo.
>
> The worker boot code is tied to the current container contract, so
> pre-launched workers would presumably not use that code path and are not 
> be
> bound by its assumptions. In particular, such a setup might want to invert
> who initiates the connection from the SDK worker to the runner. Pipeline
> options and global state in the SDK and user functions process might make
> it difficult to safely reuse worker processes across pipelines, but also
> doable in certain scenarios.
>

 This is not that hard actually and most java env do it.

 Main concern is 1. Being tight to an impl detail and 2. A bad
 architecture which doeent embrace the community

>>> Could you please be more specific? Concerns about Docker dependency have
>>> already been repeatedly addressed in this thread.
>>>
>>
>> My concern is that beam is being driven by an implementation instead of a
>> clear and scalable architecture.
>>
>> The best demonstration is the protobuf usage which is far to be the best
>> choice for portability these days due to the implication of its stack in
>> several languages (nobody wants it in its classpath in java/scala these
>> days for instance cause of conflicts or security careness its requires).
>> Json is very tooled and trivial to use whatever lib you want to rely on, in
>> any language or environment to cite just one alternative.
>>
>> Being portable (language) is a good goal but IMHO requires:
>>
>> 1. Runners in each language (otherwise fallback on the jsr223 and you are
>> good with just a json facade)
>> 2. A generic runner able to route each task to the right native runner
>> 3. A way to run in a single runner when relevant (keep in mind most of
>> java users dont even want to see python or portable code or api in their
>> 

Re: Graal instead of docker?

2018-05-11 Thread Andrew Pilloud
Json and Protobuf aren't the same thing. Json is for exchanging
unstructured data, Protobuf is for exchanging structured data. The point of
Portability is to define a protocol for exchanging structured messages
across languages. What do you propose using on top of Json to define
message structure?

I'd like to see the generic runner rewritten in Golang so we can eliminate
the significant overhead imposed by the JVM. I would argue that Go is the
best language for low overhead infrastructure, and is already widely used
by projects in this space such as Docker, Kubernetes, InfluxDB. Even SQL
can take advantage of this. For example, several runners could be passed
raw SQL and use their own SQL engines to implement more efficient
transforms then generic Beam can. Users will save significant $$$ on
infrastructure by not having to involve the JVM at all.

Andrew

On Fri, May 11, 2018 at 8:53 AM Romain Manni-Bucau 
wrote:

>
>
> Le mer. 9 mai 2018 17:41, Eugene Kirpichov  a
> écrit :
>
>>
>>
>> On Wed, May 9, 2018 at 1:08 AM Romain Manni-Bucau 
>> wrote:
>>
>>>
>>>
>>> Le mer. 9 mai 2018 00:57, Henning Rohde  a écrit :
>>>
 There are indeed lots of possibilities for interesting docker
 alternatives with different tradeoffs and capabilities, but in generally
 both the runner as well as the SDK must support them for it to work. As
 mentioned, docker (as used in the container contract) is meant as a
 flexible main option but not necessarily the only option. I see no problem
 with certain pipeline-SDK-runner combinations additionally supporting a
 specialized setup. Pipeline can be a factor, because that some transforms
 might depend on aspects of the runtime environment -- such as system
 libraries or shelling out to a /bin/foo.

 The worker boot code is tied to the current container contract, so
 pre-launched workers would presumably not use that code path and are not be
 bound by its assumptions. In particular, such a setup might want to invert
 who initiates the connection from the SDK worker to the runner. Pipeline
 options and global state in the SDK and user functions process might make
 it difficult to safely reuse worker processes across pipelines, but also
 doable in certain scenarios.

>>>
>>> This is not that hard actually and most java env do it.
>>>
>>> Main concern is 1. Being tight to an impl detail and 2. A bad
>>> architecture which doeent embrace the community
>>>
>> Could you please be more specific? Concerns about Docker dependency have
>> already been repeatedly addressed in this thread.
>>
>
> My concern is that beam is being driven by an implementation instead of a
> clear and scalable architecture.
>
> The best demonstration is the protobuf usage which is far to be the best
> choice for portability these days due to the implication of its stack in
> several languages (nobody wants it in its classpath in java/scala these
> days for instance cause of conflicts or security careness its requires).
> Json is very tooled and trivial to use whatever lib you want to rely on, in
> any language or environment to cite just one alternative.
>
> Being portable (language) is a good goal but IMHO requires:
>
> 1. Runners in each language (otherwise fallback on the jsr223 and you are
> good with just a json facade)
> 2. A generic runner able to route each task to the right native runner
> 3. A way to run in a single runner when relevant (keep in mind most of
> java users dont even want to see python or portable code or api in their
> classpath and runner)
>
>
>
>
>>
>>>
>>>
>>>
 Henning

 On Tue, May 8, 2018 at 3:51 PM Thomas Weise  wrote:

>
>
> On Sat, May 5, 2018 at 3:58 PM, Robert Bradshaw 
> wrote:
>
>>
>> I would welcome changes to
>>
>> https://github.com/apache/beam/blob/v2.4.0/model/pipeline/src/main/proto/beam_runner_api.proto#L730
>> that would provide alternatives to docker (one of which comes to mind
>> is "I
>> already brought up a worker(s) for you (which could be the same
>> process
>> that handled pipeline construction in testing scenarios), here's how
>> to
>> connect to it/them.") Another option, which would seem to appeal to
>> you in
>> particular, would be "the worker code is linked into the runner's
>> binary,
>> use this process as the worker" (though note even for java-on-java,
>> it can
>> be advantageous to shield the worker and runner code from each others
>> environments, dependencies, and version requirements.) This latter
>> should
>> still likely use the FnApi to talk to itself (either over GRPC on
>> local
>> ports, or possibly better via direct function calls eliminating the
>> RPC
>> overhead altogether--this is how the fast local runner in Python

Re: Graal instead of docker?

2018-05-11 Thread Romain Manni-Bucau
Le mer. 9 mai 2018 17:41, Eugene Kirpichov  a écrit :

>
>
> On Wed, May 9, 2018 at 1:08 AM Romain Manni-Bucau 
> wrote:
>
>>
>>
>> Le mer. 9 mai 2018 00:57, Henning Rohde  a écrit :
>>
>>> There are indeed lots of possibilities for interesting docker
>>> alternatives with different tradeoffs and capabilities, but in generally
>>> both the runner as well as the SDK must support them for it to work. As
>>> mentioned, docker (as used in the container contract) is meant as a
>>> flexible main option but not necessarily the only option. I see no problem
>>> with certain pipeline-SDK-runner combinations additionally supporting a
>>> specialized setup. Pipeline can be a factor, because that some transforms
>>> might depend on aspects of the runtime environment -- such as system
>>> libraries or shelling out to a /bin/foo.
>>>
>>> The worker boot code is tied to the current container contract, so
>>> pre-launched workers would presumably not use that code path and are not be
>>> bound by its assumptions. In particular, such a setup might want to invert
>>> who initiates the connection from the SDK worker to the runner. Pipeline
>>> options and global state in the SDK and user functions process might make
>>> it difficult to safely reuse worker processes across pipelines, but also
>>> doable in certain scenarios.
>>>
>>
>> This is not that hard actually and most java env do it.
>>
>> Main concern is 1. Being tight to an impl detail and 2. A bad
>> architecture which doeent embrace the community
>>
> Could you please be more specific? Concerns about Docker dependency have
> already been repeatedly addressed in this thread.
>

My concern is that beam is being driven by an implementation instead of a
clear and scalable architecture.

The best demonstration is the protobuf usage which is far to be the best
choice for portability these days due to the implication of its stack in
several languages (nobody wants it in its classpath in java/scala these
days for instance cause of conflicts or security careness its requires).
Json is very tooled and trivial to use whatever lib you want to rely on, in
any language or environment to cite just one alternative.

Being portable (language) is a good goal but IMHO requires:

1. Runners in each language (otherwise fallback on the jsr223 and you are
good with just a json facade)
2. A generic runner able to route each task to the right native runner
3. A way to run in a single runner when relevant (keep in mind most of java
users dont even want to see python or portable code or api in their
classpath and runner)




>
>>
>>
>>
>>> Henning
>>>
>>> On Tue, May 8, 2018 at 3:51 PM Thomas Weise  wrote:
>>>


 On Sat, May 5, 2018 at 3:58 PM, Robert Bradshaw 
 wrote:

>
> I would welcome changes to
>
> https://github.com/apache/beam/blob/v2.4.0/model/pipeline/src/main/proto/beam_runner_api.proto#L730
> that would provide alternatives to docker (one of which comes to mind
> is "I
> already brought up a worker(s) for you (which could be the same process
> that handled pipeline construction in testing scenarios), here's how to
> connect to it/them.") Another option, which would seem to appeal to
> you in
> particular, would be "the worker code is linked into the runner's
> binary,
> use this process as the worker" (though note even for java-on-java, it
> can
> be advantageous to shield the worker and runner code from each others
> environments, dependencies, and version requirements.) This latter
> should
> still likely use the FnApi to talk to itself (either over GRPC on local
> ports, or possibly better via direct function calls eliminating the RPC
> overhead altogether--this is how the fast local runner in Python
> works).
> There may be runner environments well controlled enough that "start up
> the
> workers" could be specified as "run this command line." We should make
> this
> environment message extensible to other alternatives than "docker
> container
> url," though of course we don't want the set of options to grow too
> large
> or we loose the promise of portability unless every runner supports
> every
> protocol.
>
>
 The pre-launched worker would be an interesting option, which might
 work well for a sidecar deployment.

 The current worker boot code though makes the assumption that the
 runner endpoint to phone home to is known when the process is launched.
 That doesn't work so well with a runner that establishes its endpoint
 dynamically. Also, the assumption is baked in that a worker will only serve
 a single pipeline (provisioning API etc.).

 Thanks,
 Thomas


>>>


Re: Graal instead of docker?

2018-05-09 Thread Eugene Kirpichov
On Wed, May 9, 2018 at 1:08 AM Romain Manni-Bucau 
wrote:

>
>
> Le mer. 9 mai 2018 00:57, Henning Rohde  a écrit :
>
>> There are indeed lots of possibilities for interesting docker
>> alternatives with different tradeoffs and capabilities, but in generally
>> both the runner as well as the SDK must support them for it to work. As
>> mentioned, docker (as used in the container contract) is meant as a
>> flexible main option but not necessarily the only option. I see no problem
>> with certain pipeline-SDK-runner combinations additionally supporting a
>> specialized setup. Pipeline can be a factor, because that some transforms
>> might depend on aspects of the runtime environment -- such as system
>> libraries or shelling out to a /bin/foo.
>>
>> The worker boot code is tied to the current container contract, so
>> pre-launched workers would presumably not use that code path and are not be
>> bound by its assumptions. In particular, such a setup might want to invert
>> who initiates the connection from the SDK worker to the runner. Pipeline
>> options and global state in the SDK and user functions process might make
>> it difficult to safely reuse worker processes across pipelines, but also
>> doable in certain scenarios.
>>
>
> This is not that hard actually and most java env do it.
>
> Main concern is 1. Being tight to an impl detail and 2. A bad architecture
> which doeent embrace the community
>
Could you please be more specific? Concerns about Docker dependency have
already been repeatedly addressed in this thread.


>
>
>
>> Henning
>>
>> On Tue, May 8, 2018 at 3:51 PM Thomas Weise  wrote:
>>
>>>
>>>
>>> On Sat, May 5, 2018 at 3:58 PM, Robert Bradshaw 
>>> wrote:
>>>

 I would welcome changes to

 https://github.com/apache/beam/blob/v2.4.0/model/pipeline/src/main/proto/beam_runner_api.proto#L730
 that would provide alternatives to docker (one of which comes to mind
 is "I
 already brought up a worker(s) for you (which could be the same process
 that handled pipeline construction in testing scenarios), here's how to
 connect to it/them.") Another option, which would seem to appeal to you
 in
 particular, would be "the worker code is linked into the runner's
 binary,
 use this process as the worker" (though note even for java-on-java, it
 can
 be advantageous to shield the worker and runner code from each others
 environments, dependencies, and version requirements.) This latter
 should
 still likely use the FnApi to talk to itself (either over GRPC on local
 ports, or possibly better via direct function calls eliminating the RPC
 overhead altogether--this is how the fast local runner in Python works).
 There may be runner environments well controlled enough that "start up
 the
 workers" could be specified as "run this command line." We should make
 this
 environment message extensible to other alternatives than "docker
 container
 url," though of course we don't want the set of options to grow too
 large
 or we loose the promise of portability unless every runner supports
 every
 protocol.


>>> The pre-launched worker would be an interesting option, which might work
>>> well for a sidecar deployment.
>>>
>>> The current worker boot code though makes the assumption that the runner
>>> endpoint to phone home to is known when the process is launched. That
>>> doesn't work so well with a runner that establishes its endpoint
>>> dynamically. Also, the assumption is baked in that a worker will only serve
>>> a single pipeline (provisioning API etc.).
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>


Re: Graal instead of docker?

2018-05-09 Thread Romain Manni-Bucau
Le mer. 9 mai 2018 00:57, Henning Rohde  a écrit :

> There are indeed lots of possibilities for interesting docker alternatives
> with different tradeoffs and capabilities, but in generally both the runner
> as well as the SDK must support them for it to work. As mentioned, docker
> (as used in the container contract) is meant as a flexible main option but
> not necessarily the only option. I see no problem with certain
> pipeline-SDK-runner combinations additionally supporting a specialized
> setup. Pipeline can be a factor, because that some transforms might depend
> on aspects of the runtime environment -- such as system libraries or
> shelling out to a /bin/foo.
>
> The worker boot code is tied to the current container contract, so
> pre-launched workers would presumably not use that code path and are not be
> bound by its assumptions. In particular, such a setup might want to invert
> who initiates the connection from the SDK worker to the runner. Pipeline
> options and global state in the SDK and user functions process might make
> it difficult to safely reuse worker processes across pipelines, but also
> doable in certain scenarios.
>

This is not that hard actually and most java env do it.

Main concern is 1. Being tight to an impl detail and 2. A bad architecture
which doeent embrace the community



> Henning
>
> On Tue, May 8, 2018 at 3:51 PM Thomas Weise  wrote:
>
>>
>>
>> On Sat, May 5, 2018 at 3:58 PM, Robert Bradshaw 
>> wrote:
>>
>>>
>>> I would welcome changes to
>>>
>>> https://github.com/apache/beam/blob/v2.4.0/model/pipeline/src/main/proto/beam_runner_api.proto#L730
>>> that would provide alternatives to docker (one of which comes to mind is
>>> "I
>>> already brought up a worker(s) for you (which could be the same process
>>> that handled pipeline construction in testing scenarios), here's how to
>>> connect to it/them.") Another option, which would seem to appeal to you
>>> in
>>> particular, would be "the worker code is linked into the runner's binary,
>>> use this process as the worker" (though note even for java-on-java, it
>>> can
>>> be advantageous to shield the worker and runner code from each others
>>> environments, dependencies, and version requirements.) This latter should
>>> still likely use the FnApi to talk to itself (either over GRPC on local
>>> ports, or possibly better via direct function calls eliminating the RPC
>>> overhead altogether--this is how the fast local runner in Python works).
>>> There may be runner environments well controlled enough that "start up
>>> the
>>> workers" could be specified as "run this command line." We should make
>>> this
>>> environment message extensible to other alternatives than "docker
>>> container
>>> url," though of course we don't want the set of options to grow too large
>>> or we loose the promise of portability unless every runner supports every
>>> protocol.
>>>
>>>
>> The pre-launched worker would be an interesting option, which might work
>> well for a sidecar deployment.
>>
>> The current worker boot code though makes the assumption that the runner
>> endpoint to phone home to is known when the process is launched. That
>> doesn't work so well with a runner that establishes its endpoint
>> dynamically. Also, the assumption is baked in that a worker will only serve
>> a single pipeline (provisioning API etc.).
>>
>> Thanks,
>> Thomas
>>
>>
>


Re: Graal instead of docker?

2018-05-08 Thread Henning Rohde
There are indeed lots of possibilities for interesting docker alternatives
with different tradeoffs and capabilities, but in generally both the runner
as well as the SDK must support them for it to work. As mentioned, docker
(as used in the container contract) is meant as a flexible main option but
not necessarily the only option. I see no problem with certain
pipeline-SDK-runner combinations additionally supporting a specialized
setup. Pipeline can be a factor, because that some transforms might depend
on aspects of the runtime environment -- such as system libraries or
shelling out to a /bin/foo.

The worker boot code is tied to the current container contract, so
pre-launched workers would presumably not use that code path and are not be
bound by its assumptions. In particular, such a setup might want to invert
who initiates the connection from the SDK worker to the runner. Pipeline
options and global state in the SDK and user functions process might make
it difficult to safely reuse worker processes across pipelines, but also
doable in certain scenarios.

Henning

On Tue, May 8, 2018 at 3:51 PM Thomas Weise  wrote:

>
>
> On Sat, May 5, 2018 at 3:58 PM, Robert Bradshaw 
> wrote:
>
>>
>> I would welcome changes to
>>
>> https://github.com/apache/beam/blob/v2.4.0/model/pipeline/src/main/proto/beam_runner_api.proto#L730
>> that would provide alternatives to docker (one of which comes to mind is
>> "I
>> already brought up a worker(s) for you (which could be the same process
>> that handled pipeline construction in testing scenarios), here's how to
>> connect to it/them.") Another option, which would seem to appeal to you in
>> particular, would be "the worker code is linked into the runner's binary,
>> use this process as the worker" (though note even for java-on-java, it can
>> be advantageous to shield the worker and runner code from each others
>> environments, dependencies, and version requirements.) This latter should
>> still likely use the FnApi to talk to itself (either over GRPC on local
>> ports, or possibly better via direct function calls eliminating the RPC
>> overhead altogether--this is how the fast local runner in Python works).
>> There may be runner environments well controlled enough that "start up the
>> workers" could be specified as "run this command line." We should make
>> this
>> environment message extensible to other alternatives than "docker
>> container
>> url," though of course we don't want the set of options to grow too large
>> or we loose the promise of portability unless every runner supports every
>> protocol.
>>
>>
> The pre-launched worker would be an interesting option, which might work
> well for a sidecar deployment.
>
> The current worker boot code though makes the assumption that the runner
> endpoint to phone home to is known when the process is launched. That
> doesn't work so well with a runner that establishes its endpoint
> dynamically. Also, the assumption is baked in that a worker will only serve
> a single pipeline (provisioning API etc.).
>
> Thanks,
> Thomas
>
>


Re: Graal instead of docker?

2018-05-08 Thread Thomas Weise
On Sat, May 5, 2018 at 3:58 PM, Robert Bradshaw  wrote:

>
> I would welcome changes to
> https://github.com/apache/beam/blob/v2.4.0/model/
> pipeline/src/main/proto/beam_runner_api.proto#L730
> that would provide alternatives to docker (one of which comes to mind is "I
> already brought up a worker(s) for you (which could be the same process
> that handled pipeline construction in testing scenarios), here's how to
> connect to it/them.") Another option, which would seem to appeal to you in
> particular, would be "the worker code is linked into the runner's binary,
> use this process as the worker" (though note even for java-on-java, it can
> be advantageous to shield the worker and runner code from each others
> environments, dependencies, and version requirements.) This latter should
> still likely use the FnApi to talk to itself (either over GRPC on local
> ports, or possibly better via direct function calls eliminating the RPC
> overhead altogether--this is how the fast local runner in Python works).
> There may be runner environments well controlled enough that "start up the
> workers" could be specified as "run this command line." We should make this
> environment message extensible to other alternatives than "docker container
> url," though of course we don't want the set of options to grow too large
> or we loose the promise of portability unless every runner supports every
> protocol.
>
>
The pre-launched worker would be an interesting option, which might work
well for a sidecar deployment.

The current worker boot code though makes the assumption that the runner
endpoint to phone home to is known when the process is launched. That
doesn't work so well with a runner that establishes its endpoint
dynamically. Also, the assumption is baked in that a worker will only serve
a single pipeline (provisioning API etc.).

Thanks,
Thomas


Re: Graal instead of docker?

2018-05-08 Thread Eugene Kirpichov
On Tue, May 8, 2018 at 3:52 AM Romain Manni-Bucau 
wrote:

>
>
> Le mar. 8 mai 2018 10:16, Robert Bradshaw  a écrit :
>
>> On Sun, May 6, 2018 at 1:30 AM Romain Manni-Bucau 
>> wrote:
>>
>> > Wow, this mail should be on the website Robert, thanks for it
>>
>> > I still have a point to try to understand better: my view is that once
>> submitted the only perf related point is when you hit a flow of data. So a
>> split can be slow bit it is not a that big deal. So a runner integration
>> only needs to optimize process and nextElement logics, right?
>>
>> Yes. In some streaming cases (e.g. microbatch like Spark or Dataflow)
>> there
>> may be many, many bundles, so the "control plane" part can't be /too/
>> slow,
>> but it's not as performance critical.
>>
>> > It is almost always doable to batch that - with triggers and other
>> constraints. So the portable model is elegant but not done to be "fast" in
>> current state of impl.
>>
>> Actually batching and streaming RPCs for the data plane has been there
>> from
>> the start, for these reasons.
>>
>> > So this all leads to 2 needs:
>>
>> > 1. Have some native runner for dev
>> > 2. Have some bulk api for prod
>>
>> > In all cases this is decoralated of any runner no? Can even be a beam
>> subproject built on top of beam which would be very sane and ensure a
>> clear
>> separation of concerns no?
>>
>> The thing to do here would be to extend the Environment (message) to allow
>> for alternatives, and then abstract out the creation of an bundle executor
>> such that different once could be instantiated based on this environment.
>>
>
> Agree so we need a generic runner delegating to "subrunners" (or runner
> impl) instead of impl-ing it in all runners. Sounds very sane, scalable and
> extensible/composable this way.
>
> Can we mark it as a backlog item and goal?
>
> That's what java-fn-execution is doing, it's a library of various useful
things that different portable runners can utilize in case their control
code is written in Java - including e.g. intefacing with Docker

or
with something else.


>
>
>> > Le 6 mai 2018 00:59, "Robert Bradshaw"  a écrit :
>>
>> >> Portability, at its core, is providing a spec for any runner to talk to
>> any
>> >> SDK. I personally think it's done a great job in cleaning up the model
>> by
>> >> forcing us to define a clean boundary (as specified at
>> >> https://github.com/apache/beam/tree/master/model ) between these two
>> >> components (even if the implementations of one or the other are
>> >> (temporarily, I hope for the most part) complicated).The pipeline (on
>> the
>> >> runner submission side) and work execution (on what has traditionally
>> been
>> >> called the fn api side) have concrete platform-independent
>> descriptions,
>> >> rather than being a set of Java classes.
>>
>> >> Currently, the portion that lives on the "runner" side of this boundary
>> is
>> >> often shared among Java runners (via libraries like runners core), but
>> it
>> >> is all still part of each runner, and because of this it removes the
>> >> requirement for the Runner be Java just like it remove the requirement
>> for
>> >> the SDK to speak Java. (For example, I think a Python Dask runner
>> makes a
>> >> lot of sense, Dataflow may decide to implement larger portions of its
>> >> runner in Go or C++ or even behind a service, and I've used the Python
>> >> ULRunner to run the Java SDK over the Fn API for testing development
>> >> purposes).
>>
>> >> There is also the question of "why docker?" I actually don't see docker
>> all
>> >> that intrinsic to the protocol; one only needs to be able to define and
>> >> bring up workers that communicate on specified ports. Docker happens to
>> be
>> >> a fairly well supported way to package up an arbitrary chunk of code
>> (in
>> >> any language), together with its nearly arbitrarily specified
>> >> dependencies/environment, in a way that's well specified and easy to
>> start
>> >> up.
>>
>> >> I would welcome changes to
>>
>>
>> https://github.com/apache/beam/blob/v2.4.0/model/pipeline/src/main/proto/beam_runner_api.proto#L730
>> >> that would provide alternatives to docker (one of which comes to mind
>> is
>> "I
>> >> already brought up a worker(s) for you (which could be the same process
>> >> that handled pipeline construction in testing scenarios), here's how to
>> >> connect to it/them.") Another option, which would seem to appeal to you
>> in
>> >> particular, would be "the worker code is linked into the runner's
>> binary,
>> >> use this process as the worker" (though note even for java-on-java, it
>> can
>> >> be advantageous to shield the worker and runner code from each others
>> >> environments, dependencies, and version requirements.) This 

Re: Graal instead of docker?

2018-05-08 Thread Jean-Baptiste Onofré
It sounds reasonable to me and makes more sense.


Regards
JB

Le 8 mai 2018 à 12:53, à 12:53, Romain Manni-Bucau  a 
écrit:
>Le mar. 8 mai 2018 10:16, Robert Bradshaw  a écrit
>:
>
>> On Sun, May 6, 2018 at 1:30 AM Romain Manni-Bucau
>
>> wrote:
>>
>> > Wow, this mail should be on the website Robert, thanks for it
>>
>> > I still have a point to try to understand better: my view is that
>once
>> submitted the only perf related point is when you hit a flow of data.
>So a
>> split can be slow bit it is not a that big deal. So a runner
>integration
>> only needs to optimize process and nextElement logics, right?
>>
>> Yes. In some streaming cases (e.g. microbatch like Spark or Dataflow)
>there
>> may be many, many bundles, so the "control plane" part can't be /too/
>slow,
>> but it's not as performance critical.
>>
>> > It is almost always doable to batch that - with triggers and other
>> constraints. So the portable model is elegant but not done to be
>"fast" in
>> current state of impl.
>>
>> Actually batching and streaming RPCs for the data plane has been
>there from
>> the start, for these reasons.
>>
>> > So this all leads to 2 needs:
>>
>> > 1. Have some native runner for dev
>> > 2. Have some bulk api for prod
>>
>> > In all cases this is decoralated of any runner no? Can even be a
>beam
>> subproject built on top of beam which would be very sane and ensure a
>clear
>> separation of concerns no?
>>
>> The thing to do here would be to extend the Environment (message) to
>allow
>> for alternatives, and then abstract out the creation of an bundle
>executor
>> such that different once could be instantiated based on this
>environment.
>>
>
>Agree so we need a generic runner delegating to "subrunners" (or runner
>impl) instead of impl-ing it in all runners. Sounds very sane, scalable
>and
>extensible/composable this way.
>
>Can we mark it as a backlog item and goal?
>
>
>
>> > Le 6 mai 2018 00:59, "Robert Bradshaw"  a
>écrit :
>>
>> >> Portability, at its core, is providing a spec for any runner to
>talk to
>> any
>> >> SDK. I personally think it's done a great job in cleaning up the
>model
>> by
>> >> forcing us to define a clean boundary (as specified at
>> >> https://github.com/apache/beam/tree/master/model ) between these
>two
>> >> components (even if the implementations of one or the other are
>> >> (temporarily, I hope for the most part) complicated).The pipeline
>(on
>> the
>> >> runner submission side) and work execution (on what has
>traditionally
>> been
>> >> called the fn api side) have concrete platform-independent
>descriptions,
>> >> rather than being a set of Java classes.
>>
>> >> Currently, the portion that lives on the "runner" side of this
>boundary
>> is
>> >> often shared among Java runners (via libraries like runners core),
>but
>> it
>> >> is all still part of each runner, and because of this it removes
>the
>> >> requirement for the Runner be Java just like it remove the
>requirement
>> for
>> >> the SDK to speak Java. (For example, I think a Python Dask runner
>makes
>> a
>> >> lot of sense, Dataflow may decide to implement larger portions of
>its
>> >> runner in Go or C++ or even behind a service, and I've used the
>Python
>> >> ULRunner to run the Java SDK over the Fn API for testing
>development
>> >> purposes).
>>
>> >> There is also the question of "why docker?" I actually don't see
>docker
>> all
>> >> that intrinsic to the protocol; one only needs to be able to
>define and
>> >> bring up workers that communicate on specified ports. Docker
>happens to
>> be
>> >> a fairly well supported way to package up an arbitrary chunk of
>code (in
>> >> any language), together with its nearly arbitrarily specified
>> >> dependencies/environment, in a way that's well specified and easy
>to
>> start
>> >> up.
>>
>> >> I would welcome changes to
>>
>>
>>
>https://github.com/apache/beam/blob/v2.4.0/model/pipeline/src/main/proto/beam_runner_api.proto#L730
>> >> that would provide alternatives to docker (one of which comes to
>mind is
>> "I
>> >> already brought up a worker(s) for you (which could be the same
>process
>> >> that handled pipeline construction in testing scenarios), here's
>how to
>> >> connect to it/them.") Another option, which would seem to appeal
>to you
>> in
>> >> particular, would be "the worker code is linked into the runner's
>> binary,
>> >> use this process as the worker" (though note even for
>java-on-java, it
>> can
>> >> be advantageous to shield the worker and runner code from each
>others
>> >> environments, dependencies, and version requirements.) This latter
>> should
>> >> still likely use the FnApi to talk to itself (either over GRPC on
>local
>> >> ports, or possibly better via direct function calls eliminating
>the RPC
>> >> overhead altogether--this is how the fast local runner in Python
>works).
>> >> There may be runner environments well controlled enough that

Re: Graal instead of docker?

2018-05-08 Thread Romain Manni-Bucau
Le mar. 8 mai 2018 10:16, Robert Bradshaw  a écrit :

> On Sun, May 6, 2018 at 1:30 AM Romain Manni-Bucau 
> wrote:
>
> > Wow, this mail should be on the website Robert, thanks for it
>
> > I still have a point to try to understand better: my view is that once
> submitted the only perf related point is when you hit a flow of data. So a
> split can be slow bit it is not a that big deal. So a runner integration
> only needs to optimize process and nextElement logics, right?
>
> Yes. In some streaming cases (e.g. microbatch like Spark or Dataflow) there
> may be many, many bundles, so the "control plane" part can't be /too/ slow,
> but it's not as performance critical.
>
> > It is almost always doable to batch that - with triggers and other
> constraints. So the portable model is elegant but not done to be "fast" in
> current state of impl.
>
> Actually batching and streaming RPCs for the data plane has been there from
> the start, for these reasons.
>
> > So this all leads to 2 needs:
>
> > 1. Have some native runner for dev
> > 2. Have some bulk api for prod
>
> > In all cases this is decoralated of any runner no? Can even be a beam
> subproject built on top of beam which would be very sane and ensure a clear
> separation of concerns no?
>
> The thing to do here would be to extend the Environment (message) to allow
> for alternatives, and then abstract out the creation of an bundle executor
> such that different once could be instantiated based on this environment.
>

Agree so we need a generic runner delegating to "subrunners" (or runner
impl) instead of impl-ing it in all runners. Sounds very sane, scalable and
extensible/composable this way.

Can we mark it as a backlog item and goal?



> > Le 6 mai 2018 00:59, "Robert Bradshaw"  a écrit :
>
> >> Portability, at its core, is providing a spec for any runner to talk to
> any
> >> SDK. I personally think it's done a great job in cleaning up the model
> by
> >> forcing us to define a clean boundary (as specified at
> >> https://github.com/apache/beam/tree/master/model ) between these two
> >> components (even if the implementations of one or the other are
> >> (temporarily, I hope for the most part) complicated).The pipeline (on
> the
> >> runner submission side) and work execution (on what has traditionally
> been
> >> called the fn api side) have concrete platform-independent descriptions,
> >> rather than being a set of Java classes.
>
> >> Currently, the portion that lives on the "runner" side of this boundary
> is
> >> often shared among Java runners (via libraries like runners core), but
> it
> >> is all still part of each runner, and because of this it removes the
> >> requirement for the Runner be Java just like it remove the requirement
> for
> >> the SDK to speak Java. (For example, I think a Python Dask runner makes
> a
> >> lot of sense, Dataflow may decide to implement larger portions of its
> >> runner in Go or C++ or even behind a service, and I've used the Python
> >> ULRunner to run the Java SDK over the Fn API for testing development
> >> purposes).
>
> >> There is also the question of "why docker?" I actually don't see docker
> all
> >> that intrinsic to the protocol; one only needs to be able to define and
> >> bring up workers that communicate on specified ports. Docker happens to
> be
> >> a fairly well supported way to package up an arbitrary chunk of code (in
> >> any language), together with its nearly arbitrarily specified
> >> dependencies/environment, in a way that's well specified and easy to
> start
> >> up.
>
> >> I would welcome changes to
>
>
> https://github.com/apache/beam/blob/v2.4.0/model/pipeline/src/main/proto/beam_runner_api.proto#L730
> >> that would provide alternatives to docker (one of which comes to mind is
> "I
> >> already brought up a worker(s) for you (which could be the same process
> >> that handled pipeline construction in testing scenarios), here's how to
> >> connect to it/them.") Another option, which would seem to appeal to you
> in
> >> particular, would be "the worker code is linked into the runner's
> binary,
> >> use this process as the worker" (though note even for java-on-java, it
> can
> >> be advantageous to shield the worker and runner code from each others
> >> environments, dependencies, and version requirements.) This latter
> should
> >> still likely use the FnApi to talk to itself (either over GRPC on local
> >> ports, or possibly better via direct function calls eliminating the RPC
> >> overhead altogether--this is how the fast local runner in Python works).
> >> There may be runner environments well controlled enough that "start up
> the
> >> workers" could be specified as "run this command line." We should make
> this
> >> environment message extensible to other alternatives than "docker
> container
> >> url," though of course we don't want the set of options to grow too
> large
> >> or we loose the promise of 

Re: Graal instead of docker?

2018-05-08 Thread Robert Bradshaw
On Sun, May 6, 2018 at 1:30 AM Romain Manni-Bucau 
wrote:

> Wow, this mail should be on the website Robert, thanks for it

> I still have a point to try to understand better: my view is that once
submitted the only perf related point is when you hit a flow of data. So a
split can be slow bit it is not a that big deal. So a runner integration
only needs to optimize process and nextElement logics, right?

Yes. In some streaming cases (e.g. microbatch like Spark or Dataflow) there
may be many, many bundles, so the "control plane" part can't be /too/ slow,
but it's not as performance critical.

> It is almost always doable to batch that - with triggers and other
constraints. So the portable model is elegant but not done to be "fast" in
current state of impl.

Actually batching and streaming RPCs for the data plane has been there from
the start, for these reasons.

> So this all leads to 2 needs:

> 1. Have some native runner for dev
> 2. Have some bulk api for prod

> In all cases this is decoralated of any runner no? Can even be a beam
subproject built on top of beam which would be very sane and ensure a clear
separation of concerns no?

The thing to do here would be to extend the Environment (message) to allow
for alternatives, and then abstract out the creation of an bundle executor
such that different once could be instantiated based on this environment.

> Le 6 mai 2018 00:59, "Robert Bradshaw"  a écrit :

>> Portability, at its core, is providing a spec for any runner to talk to
any
>> SDK. I personally think it's done a great job in cleaning up the model by
>> forcing us to define a clean boundary (as specified at
>> https://github.com/apache/beam/tree/master/model ) between these two
>> components (even if the implementations of one or the other are
>> (temporarily, I hope for the most part) complicated).The pipeline (on the
>> runner submission side) and work execution (on what has traditionally
been
>> called the fn api side) have concrete platform-independent descriptions,
>> rather than being a set of Java classes.

>> Currently, the portion that lives on the "runner" side of this boundary
is
>> often shared among Java runners (via libraries like runners core), but it
>> is all still part of each runner, and because of this it removes the
>> requirement for the Runner be Java just like it remove the requirement
for
>> the SDK to speak Java. (For example, I think a Python Dask runner makes a
>> lot of sense, Dataflow may decide to implement larger portions of its
>> runner in Go or C++ or even behind a service, and I've used the Python
>> ULRunner to run the Java SDK over the Fn API for testing development
>> purposes).

>> There is also the question of "why docker?" I actually don't see docker
all
>> that intrinsic to the protocol; one only needs to be able to define and
>> bring up workers that communicate on specified ports. Docker happens to
be
>> a fairly well supported way to package up an arbitrary chunk of code (in
>> any language), together with its nearly arbitrarily specified
>> dependencies/environment, in a way that's well specified and easy to
start
>> up.

>> I would welcome changes to

https://github.com/apache/beam/blob/v2.4.0/model/pipeline/src/main/proto/beam_runner_api.proto#L730
>> that would provide alternatives to docker (one of which comes to mind is
"I
>> already brought up a worker(s) for you (which could be the same process
>> that handled pipeline construction in testing scenarios), here's how to
>> connect to it/them.") Another option, which would seem to appeal to you
in
>> particular, would be "the worker code is linked into the runner's binary,
>> use this process as the worker" (though note even for java-on-java, it
can
>> be advantageous to shield the worker and runner code from each others
>> environments, dependencies, and version requirements.) This latter should
>> still likely use the FnApi to talk to itself (either over GRPC on local
>> ports, or possibly better via direct function calls eliminating the RPC
>> overhead altogether--this is how the fast local runner in Python works).
>> There may be runner environments well controlled enough that "start up
the
>> workers" could be specified as "run this command line." We should make
this
>> environment message extensible to other alternatives than "docker
container
>> url," though of course we don't want the set of options to grow too large
>> or we loose the promise of portability unless every runner supports every
>> protocol.

>> Of course, the runner is always free to execute any Fn for which it
>> completely understands the URN and the environment any way it pleases,
e.g.
>> directly in process, or even via lighter-weight mechanism like Jython or
>> Graal, rather than asking an external process to do it. But we need a
>> lowest common denominator for executing arbitrary URNs runners are not
>> expected to understand.

>> As an aside, there are also technical 

Re: Graal instead of docker?

2018-05-06 Thread Romain Manni-Bucau
Wow, this mail should be on the website Robert, thanks for it

I still have a point to try to understand better: my view is that once
submitted the only perf related point is when you hit a flow of data. So a
split can be slow bit it is not a that big deal. So a runner integration
only needs to optimize process and nextElement logics, right?

It is almost always doable to batch that - with triggers and other
constraints. So the portable model is elegant but not done to be "fast" in
current state of impl.


So this all leads to 2 needs:

1. Have some native runner for dev
2. Have some bulk api for prod

In all cases this is decoralated of any runner no? Can even be a beam
subproject built on top of beam which would be very sane and ensure a clear
separation of concerns no?

Le 6 mai 2018 00:59, "Robert Bradshaw"  a écrit :

> Portability, at its core, is providing a spec for any runner to talk to any
> SDK. I personally think it's done a great job in cleaning up the model by
> forcing us to define a clean boundary (as specified at
> https://github.com/apache/beam/tree/master/model ) between these two
> components (even if the implementations of one or the other are
> (temporarily, I hope for the most part) complicated).The pipeline (on the
> runner submission side) and work execution (on what has traditionally been
> called the fn api side) have concrete platform-independent descriptions,
> rather than being a set of Java classes.
>
> Currently, the portion that lives on the "runner" side of this boundary is
> often shared among Java runners (via libraries like runners core), but it
> is all still part of each runner, and because of this it removes the
> requirement for the Runner be Java just like it remove the requirement for
> the SDK to speak Java. (For example, I think a Python Dask runner makes a
> lot of sense, Dataflow may decide to implement larger portions of its
> runner in Go or C++ or even behind a service, and I've used the Python
> ULRunner to run the Java SDK over the Fn API for testing development
> purposes).
>
> There is also the question of "why docker?" I actually don't see docker all
> that intrinsic to the protocol; one only needs to be able to define and
> bring up workers that communicate on specified ports. Docker happens to be
> a fairly well supported way to package up an arbitrary chunk of code (in
> any language), together with its nearly arbitrarily specified
> dependencies/environment, in a way that's well specified and easy to start
> up.
>
> I would welcome changes to
> https://github.com/apache/beam/blob/v2.4.0/model/
> pipeline/src/main/proto/beam_runner_api.proto#L730
> that would provide alternatives to docker (one of which comes to mind is "I
> already brought up a worker(s) for you (which could be the same process
> that handled pipeline construction in testing scenarios), here's how to
> connect to it/them.") Another option, which would seem to appeal to you in
> particular, would be "the worker code is linked into the runner's binary,
> use this process as the worker" (though note even for java-on-java, it can
> be advantageous to shield the worker and runner code from each others
> environments, dependencies, and version requirements.) This latter should
> still likely use the FnApi to talk to itself (either over GRPC on local
> ports, or possibly better via direct function calls eliminating the RPC
> overhead altogether--this is how the fast local runner in Python works).
> There may be runner environments well controlled enough that "start up the
> workers" could be specified as "run this command line." We should make this
> environment message extensible to other alternatives than "docker container
> url," though of course we don't want the set of options to grow too large
> or we loose the promise of portability unless every runner supports every
> protocol.
>
> Of course, the runner is always free to execute any Fn for which it
> completely understands the URN and the environment any way it pleases, e.g.
> directly in process, or even via lighter-weight mechanism like Jython or
> Graal, rather than asking an external process to do it. But we need a
> lowest common denominator for executing arbitrary URNs runners are not
> expected to understand.
>
> As an aside, there are also technical limitations in implementing
> Portability
> by simply requiring all runners to be Java and the portable layer simply
> being wrappers of UserFnInLangaugeX in an equivalent UserFnObjectInJava,
> executing everything as if it were pure Java. In particular the overheads
> of unnecessarily crossing the language boundaries many times in a single
> fused graph are often prohibitive.
>
> Sorry for the long email, but hopefully this helps shed some light on (at
> least how I see) the portability effort (at the core of the Beam vision
> statement) as well as concrete actions we can take to decouple it from
> specific technologies.
>
> - Robert
>
>
> On Sat, May 5, 

Re: Graal instead of docker?

2018-05-05 Thread Robert Bradshaw
Portability, at its core, is providing a spec for any runner to talk to any
SDK. I personally think it's done a great job in cleaning up the model by
forcing us to define a clean boundary (as specified at
https://github.com/apache/beam/tree/master/model ) between these two
components (even if the implementations of one or the other are
(temporarily, I hope for the most part) complicated).The pipeline (on the
runner submission side) and work execution (on what has traditionally been
called the fn api side) have concrete platform-independent descriptions,
rather than being a set of Java classes.

Currently, the portion that lives on the "runner" side of this boundary is
often shared among Java runners (via libraries like runners core), but it
is all still part of each runner, and because of this it removes the
requirement for the Runner be Java just like it remove the requirement for
the SDK to speak Java. (For example, I think a Python Dask runner makes a
lot of sense, Dataflow may decide to implement larger portions of its
runner in Go or C++ or even behind a service, and I've used the Python
ULRunner to run the Java SDK over the Fn API for testing development
purposes).

There is also the question of "why docker?" I actually don't see docker all
that intrinsic to the protocol; one only needs to be able to define and
bring up workers that communicate on specified ports. Docker happens to be
a fairly well supported way to package up an arbitrary chunk of code (in
any language), together with its nearly arbitrarily specified
dependencies/environment, in a way that's well specified and easy to start
up.

I would welcome changes to
https://github.com/apache/beam/blob/v2.4.0/model/pipeline/src/main/proto/beam_runner_api.proto#L730
that would provide alternatives to docker (one of which comes to mind is "I
already brought up a worker(s) for you (which could be the same process
that handled pipeline construction in testing scenarios), here's how to
connect to it/them.") Another option, which would seem to appeal to you in
particular, would be "the worker code is linked into the runner's binary,
use this process as the worker" (though note even for java-on-java, it can
be advantageous to shield the worker and runner code from each others
environments, dependencies, and version requirements.) This latter should
still likely use the FnApi to talk to itself (either over GRPC on local
ports, or possibly better via direct function calls eliminating the RPC
overhead altogether--this is how the fast local runner in Python works).
There may be runner environments well controlled enough that "start up the
workers" could be specified as "run this command line." We should make this
environment message extensible to other alternatives than "docker container
url," though of course we don't want the set of options to grow too large
or we loose the promise of portability unless every runner supports every
protocol.

Of course, the runner is always free to execute any Fn for which it
completely understands the URN and the environment any way it pleases, e.g.
directly in process, or even via lighter-weight mechanism like Jython or
Graal, rather than asking an external process to do it. But we need a
lowest common denominator for executing arbitrary URNs runners are not
expected to understand.

As an aside, there are also technical limitations in implementing
Portability
by simply requiring all runners to be Java and the portable layer simply
being wrappers of UserFnInLangaugeX in an equivalent UserFnObjectInJava,
executing everything as if it were pure Java. In particular the overheads
of unnecessarily crossing the language boundaries many times in a single
fused graph are often prohibitive.

Sorry for the long email, but hopefully this helps shed some light on (at
least how I see) the portability effort (at the core of the Beam vision
statement) as well as concrete actions we can take to decouple it from
specific technologies.

- Robert


On Sat, May 5, 2018 at 2:06 PM Romain Manni-Bucau 
wrote:

> All are good points.

> The only "?" I keep is: why beam doesnt uses its visitor api to make the
portability transversal to all runners "mutating" the user model before
translation? Technically it sounds easy and avoid hacking all impl. Was it
tested and failed?

> Le 5 mai 2018 22:50, "Thomas Weise"  a écrit :

>> Docker isn't a silver bullet and may not be the best choice for all
environments (I'm also looking at potentially launching SDK workers in a
different way), but AFAIK there has not been any alternative proposal for
default SDK execution that can handle all of Python, Go and Java.

>> Regardless of the default implementation, we should strive to keep the
implementation modular so users can plug in their own replacement as
needed. Looking at the prototype implementation, Docker comes downstream of
FlinkExecutableStageFunction, and it will be possible to supply a custom
implementation by 

Re: Graal instead of docker?

2018-05-05 Thread Eugene Kirpichov
Not sure what you mean? Can you point to a piece of code in Beam that
you're currently characterizing as "hacking" and suggest how it could be
refactored?

On Sat, May 5, 2018 at 2:06 PM Romain Manni-Bucau 
wrote:

> All are good points.
>
> The only "?" I keep is: why beam doesnt uses its visitor api to make the
> portability transversal to all runners "mutating" the user model before
> translation? Technically it sounds easy and avoid hacking all impl. Was it
> tested and failed?
>
> Le 5 mai 2018 22:50, "Thomas Weise"  a écrit :
>
>> Docker isn't a silver bullet and may not be the best choice for all
>> environments (I'm also looking at potentially launching SDK workers in a
>> different way), but AFAIK there has not been any alternative proposal for
>> default SDK execution that can handle all of Python, Go and Java.
>>
>> Regardless of the default implementation, we should strive to keep the
>> implementation modular so users can plug in their own replacement as
>> needed. Looking at the prototype implementation, Docker comes downstream of
>> FlinkExecutableStageFunction, and it will be possible to supply a custom
>> implementation by making the translator pluggable (which I intend to work
>> on once backporting to master is complete), and possibly
>> "SDKHarnessManager" itself can also be swapped out.
>>
>> I would also prefer that for Flink and other Java based runners we retain
>> the option to inline executable stages that are in Java. I would expect a
>> good number of use cases to benefit from direct execution in the task
>> manager, and it may be good to offer the user that optimization.
>>
>> Thanks,
>> Thomas
>>
>>
>>
>> On Sat, May 5, 2018 at 12:54 PM, Eugene Kirpichov 
>> wrote:
>>
>>> To add on that: Romain, if you are really excited about Graal as a
>>> project, here are some constructive suggestions as to what you can do on a
>>> reasonably short timeframe:
>>> - Propose/prototype a design for writing UDFs in Beam SQL using Graal
>>> - Go through the portability-related design documents, come up with a
>>> more precise assessment of what parts are actually dependent on Docker's
>>> container format and/or on Docker itself, and propose a plan for untangling
>>> this dependency and opening the door to other mechanisms of cross-language
>>> execution
>>>
>>> On Sat, May 5, 2018 at 12:50 PM Eugene Kirpichov 
>>> wrote:
>>>
 Graal is a very young project, currently nowhere near the level of
 maturity or completeness as to be sufficient for Beam to fully bet its
 portability vision on it:
 - Graal currently only claims to support Java and Javascript, with Ruby
 and R in the status of "some applications may run", Python support "just
 beginning", and Go lacking altogether.
 - Regarding existing production usage, the Graal FAQ says it is "a
 project with new innovative technology in its early stages."

 That said, as Graal matures, I think it would be reasonable to keep an
 eye on it as a potential future lightweight alternative to containers for
 pipelines where Graal's level of support is sufficient for this particular
 pipeline.

 Please also keep in mind that execution of user code is only a small
 part of the overall portability picture, and dependency on Docker is an
 even smaller part of that (there is only 1 mention of the word "Docker" in
 all of Beam's portability protos, and the mention is in an out-of-date TODO
 comment). I hope this addresses your concerns.

 On Sat, May 5, 2018 at 11:49 AM Romain Manni-Bucau <
 rmannibu...@gmail.com> wrote:

> Agree
>
> The jvm is still mainstream for big data and it is trivial to have a
> remote facade to support natives but no point to have it in runners, it is
> some particular transforms or even dofn and sources only...
>
>
> Le 5 mai 2018 19:03, "Andrew Pilloud"  a écrit :
>
>> Thanks for the examples earlier, I think Hazelcast is a great
>> example of something portability might make more difficult. I'm not 
>> working
>> on portability, but my understanding is that the data sent to the runner 
>> is
>> a blob of code and the name of the container to run it in. A runner with 
>> a
>> native language (java on Hazelcast for example) could run the code 
>> directly
>> without the container if it is in a language it supports. So when 
>> Hazelcast
>> sees a known java container specified, it just loads the java blob and 
>> runs
>> it. When it sees another container it rejects the pipeline. You could use
>> Graal in the Hazelcast runner to do this for a number of languages. I 
>> would
>> expect that this could also be done in the direct runner, which similarly
>> provides a native java environment, so portable Java pipelines can be
>> tested 

Re: Graal instead of docker?

2018-05-05 Thread Romain Manni-Bucau
All are good points.

The only "?" I keep is: why beam doesnt uses its visitor api to make the
portability transversal to all runners "mutating" the user model before
translation? Technically it sounds easy and avoid hacking all impl. Was it
tested and failed?

Le 5 mai 2018 22:50, "Thomas Weise"  a écrit :

> Docker isn't a silver bullet and may not be the best choice for all
> environments (I'm also looking at potentially launching SDK workers in a
> different way), but AFAIK there has not been any alternative proposal for
> default SDK execution that can handle all of Python, Go and Java.
>
> Regardless of the default implementation, we should strive to keep the
> implementation modular so users can plug in their own replacement as
> needed. Looking at the prototype implementation, Docker comes downstream of
> FlinkExecutableStageFunction, and it will be possible to supply a custom
> implementation by making the translator pluggable (which I intend to work
> on once backporting to master is complete), and possibly
> "SDKHarnessManager" itself can also be swapped out.
>
> I would also prefer that for Flink and other Java based runners we retain
> the option to inline executable stages that are in Java. I would expect a
> good number of use cases to benefit from direct execution in the task
> manager, and it may be good to offer the user that optimization.
>
> Thanks,
> Thomas
>
>
>
> On Sat, May 5, 2018 at 12:54 PM, Eugene Kirpichov 
> wrote:
>
>> To add on that: Romain, if you are really excited about Graal as a
>> project, here are some constructive suggestions as to what you can do on a
>> reasonably short timeframe:
>> - Propose/prototype a design for writing UDFs in Beam SQL using Graal
>> - Go through the portability-related design documents, come up with a
>> more precise assessment of what parts are actually dependent on Docker's
>> container format and/or on Docker itself, and propose a plan for untangling
>> this dependency and opening the door to other mechanisms of cross-language
>> execution
>>
>> On Sat, May 5, 2018 at 12:50 PM Eugene Kirpichov 
>> wrote:
>>
>>> Graal is a very young project, currently nowhere near the level of
>>> maturity or completeness as to be sufficient for Beam to fully bet its
>>> portability vision on it:
>>> - Graal currently only claims to support Java and Javascript, with Ruby
>>> and R in the status of "some applications may run", Python support "just
>>> beginning", and Go lacking altogether.
>>> - Regarding existing production usage, the Graal FAQ says it is "a
>>> project with new innovative technology in its early stages."
>>>
>>> That said, as Graal matures, I think it would be reasonable to keep an
>>> eye on it as a potential future lightweight alternative to containers for
>>> pipelines where Graal's level of support is sufficient for this particular
>>> pipeline.
>>>
>>> Please also keep in mind that execution of user code is only a small
>>> part of the overall portability picture, and dependency on Docker is an
>>> even smaller part of that (there is only 1 mention of the word "Docker" in
>>> all of Beam's portability protos, and the mention is in an out-of-date TODO
>>> comment). I hope this addresses your concerns.
>>>
>>> On Sat, May 5, 2018 at 11:49 AM Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
 Agree

 The jvm is still mainstream for big data and it is trivial to have a
 remote facade to support natives but no point to have it in runners, it is
 some particular transforms or even dofn and sources only...


 Le 5 mai 2018 19:03, "Andrew Pilloud"  a écrit :

> Thanks for the examples earlier, I think Hazelcast is a great example
> of something portability might make more difficult. I'm not working on
> portability, but my understanding is that the data sent to the runner is a
> blob of code and the name of the container to run it in. A runner with a
> native language (java on Hazelcast for example) could run the code 
> directly
> without the container if it is in a language it supports. So when 
> Hazelcast
> sees a known java container specified, it just loads the java blob and 
> runs
> it. When it sees another container it rejects the pipeline. You could use
> Graal in the Hazelcast runner to do this for a number of languages. I 
> would
> expect that this could also be done in the direct runner, which similarly
> provides a native java environment, so portable Java pipelines can be
> tested without docker?
>
> For another way to frame this: if Beam was originally written in Go,
> we would be having a different discussion. A pipeline written entirely in
> java wouldn't be possible, so instead to enable Hazelcast, we would have 
> to
> be able to run the java from portability without running the container.
>
> Andrew

Re: Graal instead of docker?

2018-05-05 Thread Thomas Weise
Docker isn't a silver bullet and may not be the best choice for all
environments (I'm also looking at potentially launching SDK workers in a
different way), but AFAIK there has not been any alternative proposal for
default SDK execution that can handle all of Python, Go and Java.

Regardless of the default implementation, we should strive to keep the
implementation modular so users can plug in their own replacement as
needed. Looking at the prototype implementation, Docker comes downstream of
FlinkExecutableStageFunction, and it will be possible to supply a custom
implementation by making the translator pluggable (which I intend to work
on once backporting to master is complete), and possibly
"SDKHarnessManager" itself can also be swapped out.

I would also prefer that for Flink and other Java based runners we retain
the option to inline executable stages that are in Java. I would expect a
good number of use cases to benefit from direct execution in the task
manager, and it may be good to offer the user that optimization.

Thanks,
Thomas



On Sat, May 5, 2018 at 12:54 PM, Eugene Kirpichov 
wrote:

> To add on that: Romain, if you are really excited about Graal as a
> project, here are some constructive suggestions as to what you can do on a
> reasonably short timeframe:
> - Propose/prototype a design for writing UDFs in Beam SQL using Graal
> - Go through the portability-related design documents, come up with a more
> precise assessment of what parts are actually dependent on Docker's
> container format and/or on Docker itself, and propose a plan for untangling
> this dependency and opening the door to other mechanisms of cross-language
> execution
>
> On Sat, May 5, 2018 at 12:50 PM Eugene Kirpichov 
> wrote:
>
>> Graal is a very young project, currently nowhere near the level of
>> maturity or completeness as to be sufficient for Beam to fully bet its
>> portability vision on it:
>> - Graal currently only claims to support Java and Javascript, with Ruby
>> and R in the status of "some applications may run", Python support "just
>> beginning", and Go lacking altogether.
>> - Regarding existing production usage, the Graal FAQ says it is "a
>> project with new innovative technology in its early stages."
>>
>> That said, as Graal matures, I think it would be reasonable to keep an
>> eye on it as a potential future lightweight alternative to containers for
>> pipelines where Graal's level of support is sufficient for this particular
>> pipeline.
>>
>> Please also keep in mind that execution of user code is only a small part
>> of the overall portability picture, and dependency on Docker is an even
>> smaller part of that (there is only 1 mention of the word "Docker" in all
>> of Beam's portability protos, and the mention is in an out-of-date TODO
>> comment). I hope this addresses your concerns.
>>
>> On Sat, May 5, 2018 at 11:49 AM Romain Manni-Bucau 
>> wrote:
>>
>>> Agree
>>>
>>> The jvm is still mainstream for big data and it is trivial to have a
>>> remote facade to support natives but no point to have it in runners, it is
>>> some particular transforms or even dofn and sources only...
>>>
>>>
>>> Le 5 mai 2018 19:03, "Andrew Pilloud"  a écrit :
>>>
 Thanks for the examples earlier, I think Hazelcast is a great example
 of something portability might make more difficult. I'm not working on
 portability, but my understanding is that the data sent to the runner is a
 blob of code and the name of the container to run it in. A runner with a
 native language (java on Hazelcast for example) could run the code directly
 without the container if it is in a language it supports. So when Hazelcast
 sees a known java container specified, it just loads the java blob and runs
 it. When it sees another container it rejects the pipeline. You could use
 Graal in the Hazelcast runner to do this for a number of languages. I would
 expect that this could also be done in the direct runner, which similarly
 provides a native java environment, so portable Java pipelines can be
 tested without docker?

 For another way to frame this: if Beam was originally written in Go, we
 would be having a different discussion. A pipeline written entirely in java
 wouldn't be possible, so instead to enable Hazelcast, we would have to be
 able to run the java from portability without running the container.

 Andrew

 On Sat, May 5, 2018 at 1:48 AM Romain Manni-Bucau <
 rmannibu...@gmail.com> wrote:

>
>
> 2018-05-05 9:27 GMT+02:00 Ismaël Mejía :
>
>> Graal would not be a viable solution for the reasons Henning and
>> Andrew
>> mentioned, or put in other words, when users choose a programming
>> language
>> they don’t choose only a ‘friendly’ syntax or programming model, they
>> choose also the ecosystem 

Re: Graal instead of docker?

2018-05-05 Thread Eugene Kirpichov
To add on that: Romain, if you are really excited about Graal as a project,
here are some constructive suggestions as to what you can do on a
reasonably short timeframe:
- Propose/prototype a design for writing UDFs in Beam SQL using Graal
- Go through the portability-related design documents, come up with a more
precise assessment of what parts are actually dependent on Docker's
container format and/or on Docker itself, and propose a plan for untangling
this dependency and opening the door to other mechanisms of cross-language
execution

On Sat, May 5, 2018 at 12:50 PM Eugene Kirpichov 
wrote:

> Graal is a very young project, currently nowhere near the level of
> maturity or completeness as to be sufficient for Beam to fully bet its
> portability vision on it:
> - Graal currently only claims to support Java and Javascript, with Ruby
> and R in the status of "some applications may run", Python support "just
> beginning", and Go lacking altogether.
> - Regarding existing production usage, the Graal FAQ says it is "a project
> with new innovative technology in its early stages."
>
> That said, as Graal matures, I think it would be reasonable to keep an eye
> on it as a potential future lightweight alternative to containers for
> pipelines where Graal's level of support is sufficient for this particular
> pipeline.
>
> Please also keep in mind that execution of user code is only a small part
> of the overall portability picture, and dependency on Docker is an even
> smaller part of that (there is only 1 mention of the word "Docker" in all
> of Beam's portability protos, and the mention is in an out-of-date TODO
> comment). I hope this addresses your concerns.
>
> On Sat, May 5, 2018 at 11:49 AM Romain Manni-Bucau 
> wrote:
>
>> Agree
>>
>> The jvm is still mainstream for big data and it is trivial to have a
>> remote facade to support natives but no point to have it in runners, it is
>> some particular transforms or even dofn and sources only...
>>
>>
>> Le 5 mai 2018 19:03, "Andrew Pilloud"  a écrit :
>>
>>> Thanks for the examples earlier, I think Hazelcast is a great example
>>> of something portability might make more difficult. I'm not working on
>>> portability, but my understanding is that the data sent to the runner is a
>>> blob of code and the name of the container to run it in. A runner with a
>>> native language (java on Hazelcast for example) could run the code directly
>>> without the container if it is in a language it supports. So when Hazelcast
>>> sees a known java container specified, it just loads the java blob and runs
>>> it. When it sees another container it rejects the pipeline. You could use
>>> Graal in the Hazelcast runner to do this for a number of languages. I would
>>> expect that this could also be done in the direct runner, which similarly
>>> provides a native java environment, so portable Java pipelines can be
>>> tested without docker?
>>>
>>> For another way to frame this: if Beam was originally written in Go, we
>>> would be having a different discussion. A pipeline written entirely in java
>>> wouldn't be possible, so instead to enable Hazelcast, we would have to be
>>> able to run the java from portability without running the container.
>>>
>>> Andrew
>>>
>>> On Sat, May 5, 2018 at 1:48 AM Romain Manni-Bucau 
>>> wrote:
>>>


 2018-05-05 9:27 GMT+02:00 Ismaël Mejía :

> Graal would not be a viable solution for the reasons Henning and Andrew
> mentioned, or put in other words, when users choose a programming
> language
> they don’t choose only a ‘friendly’ syntax or programming model, they
> choose also the ecosystem that comes with it, and the libraries that
> make
> their life easier. However isolating these user libraries/dependencies
> is a
> hard problem and so far the standard solution to this problem is to use
> operating systems containers via docker.
>

 Graal solves that Ismael. Same kind of experience than running npm libs
 on nashorn but with a more unified API to run any language soft.


>
> The Beam vision from day zero is to run pipelines written in multiple
> languages in runners in multiple systems, and so far we are not doing
> this
> in particular in the Apache runners. The portability work is the
> cleanest
> way to achieve this vision given the constraints.
>

 Hmm, did I read it wrong and we don't have specific integration of the
 portable API in runners? This is what is messing up the runners and
 limiting beam adoption on existing runners.
 Portable API is a feature buildable on top of runner, not in runners.
 Same as a runner implementing the 5-6 primitives can run anything, the
 portable API should just rely on that and not require more integration.
 It doesn't prevent more deep integrations as 

Re: Graal instead of docker?

2018-05-05 Thread Eugene Kirpichov
Graal is a very young project, currently nowhere near the level of maturity
or completeness as to be sufficient for Beam to fully bet its portability
vision on it:
- Graal currently only claims to support Java and Javascript, with Ruby and
R in the status of "some applications may run", Python support "just
beginning", and Go lacking altogether.
- Regarding existing production usage, the Graal FAQ says it is "a project
with new innovative technology in its early stages."

That said, as Graal matures, I think it would be reasonable to keep an eye
on it as a potential future lightweight alternative to containers for
pipelines where Graal's level of support is sufficient for this particular
pipeline.

Please also keep in mind that execution of user code is only a small part
of the overall portability picture, and dependency on Docker is an even
smaller part of that (there is only 1 mention of the word "Docker" in all
of Beam's portability protos, and the mention is in an out-of-date TODO
comment). I hope this addresses your concerns.

On Sat, May 5, 2018 at 11:49 AM Romain Manni-Bucau 
wrote:

> Agree
>
> The jvm is still mainstream for big data and it is trivial to have a
> remote facade to support natives but no point to have it in runners, it is
> some particular transforms or even dofn and sources only...
>
>
> Le 5 mai 2018 19:03, "Andrew Pilloud"  a écrit :
>
>> Thanks for the examples earlier, I think Hazelcast is a great example of
>> something portability might make more difficult. I'm not working on
>> portability, but my understanding is that the data sent to the runner is a
>> blob of code and the name of the container to run it in. A runner with a
>> native language (java on Hazelcast for example) could run the code directly
>> without the container if it is in a language it supports. So when Hazelcast
>> sees a known java container specified, it just loads the java blob and runs
>> it. When it sees another container it rejects the pipeline. You could use
>> Graal in the Hazelcast runner to do this for a number of languages. I would
>> expect that this could also be done in the direct runner, which similarly
>> provides a native java environment, so portable Java pipelines can be
>> tested without docker?
>>
>> For another way to frame this: if Beam was originally written in Go, we
>> would be having a different discussion. A pipeline written entirely in java
>> wouldn't be possible, so instead to enable Hazelcast, we would have to be
>> able to run the java from portability without running the container.
>>
>> Andrew
>>
>> On Sat, May 5, 2018 at 1:48 AM Romain Manni-Bucau 
>> wrote:
>>
>>>
>>>
>>> 2018-05-05 9:27 GMT+02:00 Ismaël Mejía :
>>>
 Graal would not be a viable solution for the reasons Henning and Andrew
 mentioned, or put in other words, when users choose a programming
 language
 they don’t choose only a ‘friendly’ syntax or programming model, they
 choose also the ecosystem that comes with it, and the libraries that
 make
 their life easier. However isolating these user libraries/dependencies
 is a
 hard problem and so far the standard solution to this problem is to use
 operating systems containers via docker.

>>>
>>> Graal solves that Ismael. Same kind of experience than running npm libs
>>> on nashorn but with a more unified API to run any language soft.
>>>
>>>

 The Beam vision from day zero is to run pipelines written in multiple
 languages in runners in multiple systems, and so far we are not doing
 this
 in particular in the Apache runners. The portability work is the
 cleanest
 way to achieve this vision given the constraints.

>>>
>>> Hmm, did I read it wrong and we don't have specific integration of the
>>> portable API in runners? This is what is messing up the runners and
>>> limiting beam adoption on existing runners.
>>> Portable API is a feature buildable on top of runner, not in runners.
>>> Same as a runner implementing the 5-6 primitives can run anything, the
>>> portable API should just rely on that and not require more integration.
>>> It doesn't prevent more deep integrations as for some higher level
>>> primitives existing in runners but it is not the case today for runners so
>>> shouldn't exist IMHO.
>>>
>>>

 I agree however that for the Java SDK to Java runner case this can
 represent additional pain, docker ideally should not be a requirement
 for
 Java users with the Direct runner and debugging a pipeline should be as
 easy as it is today. I think the Univerrsal Local Runner exists to cover
 the Portable case, but after looking at this JIRA I am not sure if
 unification is coming (and by consequence if docker would be mandatory).
 https://issues.apache.org/jira/browse/BEAM-4239

 I suppose for the distributed runners that they must implement the full

Re: Graal instead of docker?

2018-05-05 Thread Romain Manni-Bucau
Agree

The jvm is still mainstream for big data and it is trivial to have a remote
facade to support natives but no point to have it in runners, it is some
particular transforms or even dofn and sources only...


Le 5 mai 2018 19:03, "Andrew Pilloud"  a écrit :

> Thanks for the examples earlier, I think Hazelcast is a great example of
> something portability might make more difficult. I'm not working on
> portability, but my understanding is that the data sent to the runner is a
> blob of code and the name of the container to run it in. A runner with a
> native language (java on Hazelcast for example) could run the code directly
> without the container if it is in a language it supports. So when Hazelcast
> sees a known java container specified, it just loads the java blob and runs
> it. When it sees another container it rejects the pipeline. You could use
> Graal in the Hazelcast runner to do this for a number of languages. I would
> expect that this could also be done in the direct runner, which similarly
> provides a native java environment, so portable Java pipelines can be
> tested without docker?
>
> For another way to frame this: if Beam was originally written in Go, we
> would be having a different discussion. A pipeline written entirely in java
> wouldn't be possible, so instead to enable Hazelcast, we would have to be
> able to run the java from portability without running the container.
>
> Andrew
>
> On Sat, May 5, 2018 at 1:48 AM Romain Manni-Bucau 
> wrote:
>
>>
>>
>> 2018-05-05 9:27 GMT+02:00 Ismaël Mejía :
>>
>>> Graal would not be a viable solution for the reasons Henning and Andrew
>>> mentioned, or put in other words, when users choose a programming
>>> language
>>> they don’t choose only a ‘friendly’ syntax or programming model, they
>>> choose also the ecosystem that comes with it, and the libraries that make
>>> their life easier. However isolating these user libraries/dependencies
>>> is a
>>> hard problem and so far the standard solution to this problem is to use
>>> operating systems containers via docker.
>>>
>>
>> Graal solves that Ismael. Same kind of experience than running npm libs
>> on nashorn but with a more unified API to run any language soft.
>>
>>
>>>
>>> The Beam vision from day zero is to run pipelines written in multiple
>>> languages in runners in multiple systems, and so far we are not doing
>>> this
>>> in particular in the Apache runners. The portability work is the cleanest
>>> way to achieve this vision given the constraints.
>>>
>>
>> Hmm, did I read it wrong and we don't have specific integration of the
>> portable API in runners? This is what is messing up the runners and
>> limiting beam adoption on existing runners.
>> Portable API is a feature buildable on top of runner, not in runners.
>> Same as a runner implementing the 5-6 primitives can run anything, the
>> portable API should just rely on that and not require more integration.
>> It doesn't prevent more deep integrations as for some higher level
>> primitives existing in runners but it is not the case today for runners so
>> shouldn't exist IMHO.
>>
>>
>>>
>>> I agree however that for the Java SDK to Java runner case this can
>>> represent additional pain, docker ideally should not be a requirement for
>>> Java users with the Direct runner and debugging a pipeline should be as
>>> easy as it is today. I think the Univerrsal Local Runner exists to cover
>>> the Portable case, but after looking at this JIRA I am not sure if
>>> unification is coming (and by consequence if docker would be mandatory).
>>> https://issues.apache.org/jira/browse/BEAM-4239
>>>
>>> I suppose for the distributed runners that they must implement the full
>>> Portability APIs to be considered Beam multi language compliant but they
>>> can prefer for performance reasons to translate without the portability
>>> APIs the Java to Java case.
>>>
>>
>>
>> This is my issue, language portability must NOT impact runners at all, it
>> is just a way to forward primitives to a runner.
>> See it as a layer rewriting the pipeline and submitting it. No need to
>> modify any runner.
>>
>>
>>> On Sat, May 5, 2018 at 9:11 AM Reuven Lax  wrote:
>>>
>>> > A beam cluster with the spark runner would include a spark cluster,
>>> plus
>>> what's needed for portability, plus the beam sdk.
>>>
>>> > On Fri, May 4, 2018, 11:55 PM Romain Manni-Bucau <
>>> rmannibu...@gmail.com>
>>> wrote:
>>>
>>>
>>>
>>> >> Le 5 mai 2018 08:43, "Reuven Lax"  a écrit :
>>>
>>> >> I don't believe we enforce docker anywhere. In fact if someone wanted
>>> to
>>> run an all-windows beam cluster, they would probably not use docker for
>>> their runner (docker runs on Windows, but not efficiently).
>>>
>>>
>>>
>>> >> Or doesnt run sometimes - a colleague hit that yesterday :(.
>>>
>>> >> What is a "beam cluster" - opposed to a spark or foink cluster? How
>>> would it work on 

Re: Graal instead of docker?

2018-05-05 Thread Andrew Pilloud
Thanks for the examples earlier, I think Hazelcast is a great example of
something portability might make more difficult. I'm not working on
portability, but my understanding is that the data sent to the runner is a
blob of code and the name of the container to run it in. A runner with a
native language (java on Hazelcast for example) could run the code directly
without the container if it is in a language it supports. So when Hazelcast
sees a known java container specified, it just loads the java blob and runs
it. When it sees another container it rejects the pipeline. You could use
Graal in the Hazelcast runner to do this for a number of languages. I would
expect that this could also be done in the direct runner, which similarly
provides a native java environment, so portable Java pipelines can be
tested without docker?

For another way to frame this: if Beam was originally written in Go, we
would be having a different discussion. A pipeline written entirely in java
wouldn't be possible, so instead to enable Hazelcast, we would have to be
able to run the java from portability without running the container.

Andrew

On Sat, May 5, 2018 at 1:48 AM Romain Manni-Bucau 
wrote:

>
>
> 2018-05-05 9:27 GMT+02:00 Ismaël Mejía :
>
>> Graal would not be a viable solution for the reasons Henning and Andrew
>> mentioned, or put in other words, when users choose a programming language
>> they don’t choose only a ‘friendly’ syntax or programming model, they
>> choose also the ecosystem that comes with it, and the libraries that make
>> their life easier. However isolating these user libraries/dependencies is
>> a
>> hard problem and so far the standard solution to this problem is to use
>> operating systems containers via docker.
>>
>
> Graal solves that Ismael. Same kind of experience than running npm libs on
> nashorn but with a more unified API to run any language soft.
>
>
>>
>> The Beam vision from day zero is to run pipelines written in multiple
>> languages in runners in multiple systems, and so far we are not doing this
>> in particular in the Apache runners. The portability work is the cleanest
>> way to achieve this vision given the constraints.
>>
>
> Hmm, did I read it wrong and we don't have specific integration of the
> portable API in runners? This is what is messing up the runners and
> limiting beam adoption on existing runners.
> Portable API is a feature buildable on top of runner, not in runners.
> Same as a runner implementing the 5-6 primitives can run anything, the
> portable API should just rely on that and not require more integration.
> It doesn't prevent more deep integrations as for some higher level
> primitives existing in runners but it is not the case today for runners so
> shouldn't exist IMHO.
>
>
>>
>> I agree however that for the Java SDK to Java runner case this can
>> represent additional pain, docker ideally should not be a requirement for
>> Java users with the Direct runner and debugging a pipeline should be as
>> easy as it is today. I think the Univerrsal Local Runner exists to cover
>> the Portable case, but after looking at this JIRA I am not sure if
>> unification is coming (and by consequence if docker would be mandatory).
>> https://issues.apache.org/jira/browse/BEAM-4239
>>
>> I suppose for the distributed runners that they must implement the full
>> Portability APIs to be considered Beam multi language compliant but they
>> can prefer for performance reasons to translate without the portability
>> APIs the Java to Java case.
>>
>
>
> This is my issue, language portability must NOT impact runners at all, it
> is just a way to forward primitives to a runner.
> See it as a layer rewriting the pipeline and submitting it. No need to
> modify any runner.
>
>
>> On Sat, May 5, 2018 at 9:11 AM Reuven Lax  wrote:
>>
>> > A beam cluster with the spark runner would include a spark cluster, plus
>> what's needed for portability, plus the beam sdk.
>>
>> > On Fri, May 4, 2018, 11:55 PM Romain Manni-Bucau > >
>> wrote:
>>
>>
>>
>> >> Le 5 mai 2018 08:43, "Reuven Lax"  a écrit :
>>
>> >> I don't believe we enforce docker anywhere. In fact if someone wanted
>> to
>> run an all-windows beam cluster, they would probably not use docker for
>> their runner (docker runs on Windows, but not efficiently).
>>
>>
>>
>> >> Or doesnt run sometimes - a colleague hit that yesterday :(.
>>
>> >> What is a "beam cluster" - opposed to a spark or foink cluster? How
>> would it work on windows servers?
>>
>>
>> >> On Fri, May 4, 2018, 11:19 PM Romain Manni-Bucau <
>> rmannibu...@gmail.com>
>> wrote:
>>
>>
>>
>> >>> 2018-05-05 2:33 GMT+02:00 Andrew Pilloud :
>>
>>  What docker really buys is a package format and runtime environment
>> that is language and operating system agnostic. The docker packaging and
>> runtime format is the de facto standard for portable applications such 

Re: Graal instead of docker?

2018-05-05 Thread Romain Manni-Bucau
2018-05-05 9:27 GMT+02:00 Ismaël Mejía :

> Graal would not be a viable solution for the reasons Henning and Andrew
> mentioned, or put in other words, when users choose a programming language
> they don’t choose only a ‘friendly’ syntax or programming model, they
> choose also the ecosystem that comes with it, and the libraries that make
> their life easier. However isolating these user libraries/dependencies is a
> hard problem and so far the standard solution to this problem is to use
> operating systems containers via docker.
>

Graal solves that Ismael. Same kind of experience than running npm libs on
nashorn but with a more unified API to run any language soft.


>
> The Beam vision from day zero is to run pipelines written in multiple
> languages in runners in multiple systems, and so far we are not doing this
> in particular in the Apache runners. The portability work is the cleanest
> way to achieve this vision given the constraints.
>

Hmm, did I read it wrong and we don't have specific integration of the
portable API in runners? This is what is messing up the runners and
limiting beam adoption on existing runners.
Portable API is a feature buildable on top of runner, not in runners.
Same as a runner implementing the 5-6 primitives can run anything, the
portable API should just rely on that and not require more integration.
It doesn't prevent more deep integrations as for some higher level
primitives existing in runners but it is not the case today for runners so
shouldn't exist IMHO.


>
> I agree however that for the Java SDK to Java runner case this can
> represent additional pain, docker ideally should not be a requirement for
> Java users with the Direct runner and debugging a pipeline should be as
> easy as it is today. I think the Univerrsal Local Runner exists to cover
> the Portable case, but after looking at this JIRA I am not sure if
> unification is coming (and by consequence if docker would be mandatory).
> https://issues.apache.org/jira/browse/BEAM-4239
>
> I suppose for the distributed runners that they must implement the full
> Portability APIs to be considered Beam multi language compliant but they
> can prefer for performance reasons to translate without the portability
> APIs the Java to Java case.
>


This is my issue, language portability must NOT impact runners at all, it
is just a way to forward primitives to a runner.
See it as a layer rewriting the pipeline and submitting it. No need to
modify any runner.


> On Sat, May 5, 2018 at 9:11 AM Reuven Lax  wrote:
>
> > A beam cluster with the spark runner would include a spark cluster, plus
> what's needed for portability, plus the beam sdk.
>
> > On Fri, May 4, 2018, 11:55 PM Romain Manni-Bucau 
> wrote:
>
>
>
> >> Le 5 mai 2018 08:43, "Reuven Lax"  a écrit :
>
> >> I don't believe we enforce docker anywhere. In fact if someone wanted to
> run an all-windows beam cluster, they would probably not use docker for
> their runner (docker runs on Windows, but not efficiently).
>
>
>
> >> Or doesnt run sometimes - a colleague hit that yesterday :(.
>
> >> What is a "beam cluster" - opposed to a spark or foink cluster? How
> would it work on windows servers?
>
>
> >> On Fri, May 4, 2018, 11:19 PM Romain Manni-Bucau  >
> wrote:
>
>
>
> >>> 2018-05-05 2:33 GMT+02:00 Andrew Pilloud :
>
>  What docker really buys is a package format and runtime environment
> that is language and operating system agnostic. The docker packaging and
> runtime format is the de facto standard for portable applications such as
> this, and there is a group trying to turn it into an actual standard.
>
>  I would agree with you that dockerd has become bloated but there are
> projects that solve that. There is no longer lock-in to dockerd, there are
> package format compatible docker replacements that eliminate the
> performance issues and overhead associated with docker. CRI-O (
> https://github.com/kubernetes-incubator/cri-o) is a really cool RedHat
> project which is a minimalist replacement for docker. I was recently
> working at a startup where I migrated our "data mover" appliance from
> Docker to CRI-O. Our application was able to get direct access to the
> ethernet driver and block devices which enabled a huge performance boost
> but we were also able to run containers produced by docker without
> modification.
>
>  You mention that docker is "detail of one runner+vendor corrupting all
> the project and adding complexity and work to everyone". It sounds like you
> have a specific example you'd like to share? Is there a runner that is
> unable to move to portability because of docker?
>
>
> >>> IBM one for instance, some custom ones like an hazelcast based one,
> etc... More generally any runner developped outside beam itself - even if
> we take a snapshot today, most of beam's ones have the same pitall.
>
> >>> Note: i 

Re: Graal instead of docker?

2018-05-05 Thread Ismaël Mejía
Graal would not be a viable solution for the reasons Henning and Andrew
mentioned, or put in other words, when users choose a programming language
they don’t choose only a ‘friendly’ syntax or programming model, they
choose also the ecosystem that comes with it, and the libraries that make
their life easier. However isolating these user libraries/dependencies is a
hard problem and so far the standard solution to this problem is to use
operating systems containers via docker.

The Beam vision from day zero is to run pipelines written in multiple
languages in runners in multiple systems, and so far we are not doing this
in particular in the Apache runners. The portability work is the cleanest
way to achieve this vision given the constraints.

I agree however that for the Java SDK to Java runner case this can
represent additional pain, docker ideally should not be a requirement for
Java users with the Direct runner and debugging a pipeline should be as
easy as it is today. I think the Univerrsal Local Runner exists to cover
the Portable case, but after looking at this JIRA I am not sure if
unification is coming (and by consequence if docker would be mandatory).
https://issues.apache.org/jira/browse/BEAM-4239

I suppose for the distributed runners that they must implement the full
Portability APIs to be considered Beam multi language compliant but they
can prefer for performance reasons to translate without the portability
APIs the Java to Java case.
On Sat, May 5, 2018 at 9:11 AM Reuven Lax  wrote:

> A beam cluster with the spark runner would include a spark cluster, plus
what's needed for portability, plus the beam sdk.

> On Fri, May 4, 2018, 11:55 PM Romain Manni-Bucau 
wrote:



>> Le 5 mai 2018 08:43, "Reuven Lax"  a écrit :

>> I don't believe we enforce docker anywhere. In fact if someone wanted to
run an all-windows beam cluster, they would probably not use docker for
their runner (docker runs on Windows, but not efficiently).



>> Or doesnt run sometimes - a colleague hit that yesterday :(.

>> What is a "beam cluster" - opposed to a spark or foink cluster? How
would it work on windows servers?


>> On Fri, May 4, 2018, 11:19 PM Romain Manni-Bucau 
wrote:



>>> 2018-05-05 2:33 GMT+02:00 Andrew Pilloud :

 What docker really buys is a package format and runtime environment
that is language and operating system agnostic. The docker packaging and
runtime format is the de facto standard for portable applications such as
this, and there is a group trying to turn it into an actual standard.

 I would agree with you that dockerd has become bloated but there are
projects that solve that. There is no longer lock-in to dockerd, there are
package format compatible docker replacements that eliminate the
performance issues and overhead associated with docker. CRI-O (
https://github.com/kubernetes-incubator/cri-o) is a really cool RedHat
project which is a minimalist replacement for docker. I was recently
working at a startup where I migrated our "data mover" appliance from
Docker to CRI-O. Our application was able to get direct access to the
ethernet driver and block devices which enabled a huge performance boost
but we were also able to run containers produced by docker without
modification.

 You mention that docker is "detail of one runner+vendor corrupting all
the project and adding complexity and work to everyone". It sounds like you
have a specific example you'd like to share? Is there a runner that is
unable to move to portability because of docker?


>>> IBM one for instance, some custom ones like an hazelcast based one,
etc... More generally any runner developped outside beam itself - even if
we take a snapshot today, most of beam's ones have the same pitall.

>>> Note: i never said docker was a bad techno or so. Let me try to clarify.

>>> Main issue is that you enforce docker usage which is still trendy. It
is like scla which was promishing to kill java, check what it does today...
>>> It starts to be tooled but it is also very impacting on the deployment
side and for a good number of beam users who deploy it outside the cloud it
is an issue.
>>> Keep in mind beam is embeddable by design, it is not a runner
environment and with the docker choice it imposes some environment which is
inconsistent with beam design itself and this is where this choice blocks.



 Andrew

 On Fri, May 4, 2018 at 4:32 PM Henning Rohde 
wrote:

> Romain,

> Docker, unlike selinux, solves a great number of tangible problems
for us with IMO a relatively small tax. It does not have to be the only
way. Some of the concerns you bring up along with possibilities were also
discussed here: https://s.apache.org/beam-fn-api-container-contract. I
encourage you to take a look.

> Thanks,
>   Henning


> On Fri, May 4, 2018 at 3:18 PM Romain Manni-Bucau <

Re: Graal instead of docker?

2018-05-05 Thread Reuven Lax
A beam cluster with the spark runner would include a spark cluster, plus
what's needed for portability, plus the beam sdk.

On Fri, May 4, 2018, 11:55 PM Romain Manni-Bucau 
wrote:

>
>
> Le 5 mai 2018 08:43, "Reuven Lax"  a écrit :
>
> I don't believe we enforce docker anywhere. In fact if someone wanted to
> run an all-windows beam cluster, they would probably not use docker for
> their runner (docker runs on Windows, but not efficiently).
>
>
>
> Or doesnt run sometimes - a colleague hit that yesterday :(.
>
> What is a "beam cluster" - opposed to a spark or foink cluster? How would
> it work on windows servers?
>
>
> On Fri, May 4, 2018, 11:19 PM Romain Manni-Bucau 
> wrote:
>
>>
>>
>> 2018-05-05 2:33 GMT+02:00 Andrew Pilloud :
>>
>>> What docker really buys is a package format and runtime environment that
>>> is language and operating system agnostic. The docker packaging and
>>> runtime format is the de facto standard for portable applications such as
>>> this, and there is a group trying to turn it into an actual standard.
>>>
>>> I would agree with you that dockerd has become bloated but there are
>>> projects that solve that. There is no longer lock-in to dockerd, there
>>> are package format compatible docker replacements that eliminate the
>>> performance issues and overhead associated with docker. CRI-O (
>>> https://github.com/kubernetes-incubator/cri-o) is a really cool RedHat
>>> project which is a minimalist replacement for docker. I was recently
>>> working at a startup where I migrated our "data mover" appliance from
>>> Docker to CRI-O. Our application was able to get direct access to the
>>> ethernet driver and block devices which enabled a huge performance boost
>>> but we were also able to run containers produced by docker without
>>> modification.
>>>
>>> You mention that docker is "detail of one runner+vendor corrupting all
>>> the project and adding complexity and work to everyone". It sounds like
>>> you have a specific example you'd like to share? Is there a runner that is
>>> unable to move to portability because of docker?
>>>
>>
>> IBM one for instance, some custom ones like an hazelcast based one,
>> etc... More generally any runner developped outside beam itself - even if
>> we take a snapshot today, most of beam's ones have the same pitall.
>>
>> Note: i never said docker was a bad techno or so. Let me try to clarify.
>>
>> Main issue is that you enforce docker usage which is still trendy. It is
>> like scla which was promishing to kill java, check what it does today...
>> It starts to be tooled but it is also very impacting on the deployment
>> side and for a good number of beam users who deploy it outside the cloud it
>> is an issue.
>> Keep in mind beam is embeddable by design, it is not a runner environment
>> and with the docker choice it imposes some environment which is
>> inconsistent with beam design itself and this is where this choice blocks.
>>
>>
>>>
>>> Andrew
>>>
>>> On Fri, May 4, 2018 at 4:32 PM Henning Rohde  wrote:
>>>
 Romain,

 Docker, unlike selinux, solves a great number of tangible problems for
 us with IMO a relatively small tax. It does not have to be the only way.
 Some of the concerns you bring up along with possibilities were also
 discussed here: https://s.apache.org/beam-fn-api-container-contract. I
 encourage you to take a look.

 Thanks,
  Henning


 On Fri, May 4, 2018 at 3:18 PM Romain Manni-Bucau <
 rmannibu...@gmail.com> wrote:

>
>
> Le 4 mai 2018 21:31, "Henning Rohde"  a écrit :
>
> I disagree with the characterization of docker and the implications
> made towards portability. Graal looks like a neat project (and I
> never thought I would live to see the phrase "Practical Partial 
> Evaluation"
> ..), but it doesn't address the needs of portability. In addition to 
> Luke's
> examples, Go and most other languages don't work on it either. Docker
> containers also address packaging, OS dependencies, conflicting versions
> and distribution aspects in addition to truly universal language support.
>
>
> This is wrong, docker also has its conflicts, is not universal (fails
> on windows and mac easily - as host or not, cloud vendors put layers
> limiting or corrupting it, and it is an infra constraint imposed and a
> vendor locking not welcomed in beam IMHO).
>
> This is my main concern. All the work done looks like an
> implemzntation detail of one runner+vendor corrupting all the project and
> adding complexity and work to everyone instead of keeping it localised
> (technically it is possible).
>
> Would you accept i enforce you to use selinux? Using docker is the
> same kind of constraint.
>
>
> That said, it's entirely fine for some 

Re: Graal instead of docker?

2018-05-05 Thread Romain Manni-Bucau
Le 5 mai 2018 08:43, "Reuven Lax"  a écrit :

I don't believe we enforce docker anywhere. In fact if someone wanted to
run an all-windows beam cluster, they would probably not use docker for
their runner (docker runs on Windows, but not efficiently).



Or doesnt run sometimes - a colleague hit that yesterday :(.

What is a "beam cluster" - opposed to a spark or foink cluster? How would
it work on windows servers?


On Fri, May 4, 2018, 11:19 PM Romain Manni-Bucau 
wrote:

>
>
> 2018-05-05 2:33 GMT+02:00 Andrew Pilloud :
>
>> What docker really buys is a package format and runtime environment that
>> is language and operating system agnostic. The docker packaging and
>> runtime format is the de facto standard for portable applications such as
>> this, and there is a group trying to turn it into an actual standard.
>>
>> I would agree with you that dockerd has become bloated but there are
>> projects that solve that. There is no longer lock-in to dockerd, there
>> are package format compatible docker replacements that eliminate the
>> performance issues and overhead associated with docker. CRI-O (
>> https://github.com/kubernetes-incubator/cri-o) is a really cool RedHat
>> project which is a minimalist replacement for docker. I was recently
>> working at a startup where I migrated our "data mover" appliance from
>> Docker to CRI-O. Our application was able to get direct access to the
>> ethernet driver and block devices which enabled a huge performance boost
>> but we were also able to run containers produced by docker without
>> modification.
>>
>> You mention that docker is "detail of one runner+vendor corrupting all
>> the project and adding complexity and work to everyone". It sounds like
>> you have a specific example you'd like to share? Is there a runner that is
>> unable to move to portability because of docker?
>>
>
> IBM one for instance, some custom ones like an hazelcast based one, etc...
> More generally any runner developped outside beam itself - even if we take
> a snapshot today, most of beam's ones have the same pitall.
>
> Note: i never said docker was a bad techno or so. Let me try to clarify.
>
> Main issue is that you enforce docker usage which is still trendy. It is
> like scla which was promishing to kill java, check what it does today...
> It starts to be tooled but it is also very impacting on the deployment
> side and for a good number of beam users who deploy it outside the cloud it
> is an issue.
> Keep in mind beam is embeddable by design, it is not a runner environment
> and with the docker choice it imposes some environment which is
> inconsistent with beam design itself and this is where this choice blocks.
>
>
>>
>> Andrew
>>
>> On Fri, May 4, 2018 at 4:32 PM Henning Rohde  wrote:
>>
>>> Romain,
>>>
>>> Docker, unlike selinux, solves a great number of tangible problems for
>>> us with IMO a relatively small tax. It does not have to be the only way.
>>> Some of the concerns you bring up along with possibilities were also
>>> discussed here: https://s.apache.org/beam-fn-api-container-contract. I
>>> encourage you to take a look.
>>>
>>> Thanks,
>>>  Henning
>>>
>>>
>>> On Fri, May 4, 2018 at 3:18 PM Romain Manni-Bucau 
>>> wrote:
>>>


 Le 4 mai 2018 21:31, "Henning Rohde"  a écrit :

 I disagree with the characterization of docker and the implications
 made towards portability. Graal looks like a neat project (and I never
 thought I would live to see the phrase "Practical Partial Evaluation" ..),
 but it doesn't address the needs of portability. In addition to Luke's
 examples, Go and most other languages don't work on it either. Docker
 containers also address packaging, OS dependencies, conflicting versions
 and distribution aspects in addition to truly universal language support.


 This is wrong, docker also has its conflicts, is not universal (fails
 on windows and mac easily - as host or not, cloud vendors put layers
 limiting or corrupting it, and it is an infra constraint imposed and a
 vendor locking not welcomed in beam IMHO).

 This is my main concern. All the work done looks like an implemzntation
 detail of one runner+vendor corrupting all the project and adding
 complexity and work to everyone instead of keeping it localised
 (technically it is possible).

 Would you accept i enforce you to use selinux? Using docker is the same
 kind of constraint.


 That said, it's entirely fine for some runners to use Jython, Graal,
 etc to provide a specialized offering similar to the direct runners, but it
 would be disjoint from portability IMO.

 On Fri, May 4, 2018 at 10:14 AM Romain Manni-Bucau <
 rmannibu...@gmail.com> wrote:

>
>
> Le 4 mai 2018 17:55, "Lukasz Cwik"  a écrit :

Re: Graal instead of docker?

2018-05-05 Thread Reuven Lax
I don't believe we enforce docker anywhere. In fact if someone wanted to
run an all-windows beam cluster, they would probably not use docker for
their runner (docker runs on Windows, but not efficiently).

On Fri, May 4, 2018, 11:19 PM Romain Manni-Bucau 
wrote:

>
>
> 2018-05-05 2:33 GMT+02:00 Andrew Pilloud :
>
>> What docker really buys is a package format and runtime environment that
>> is language and operating system agnostic. The docker packaging and
>> runtime format is the de facto standard for portable applications such as
>> this, and there is a group trying to turn it into an actual standard.
>>
>> I would agree with you that dockerd has become bloated but there are
>> projects that solve that. There is no longer lock-in to dockerd, there
>> are package format compatible docker replacements that eliminate the
>> performance issues and overhead associated with docker. CRI-O (
>> https://github.com/kubernetes-incubator/cri-o) is a really cool RedHat
>> project which is a minimalist replacement for docker. I was recently
>> working at a startup where I migrated our "data mover" appliance from
>> Docker to CRI-O. Our application was able to get direct access to the
>> ethernet driver and block devices which enabled a huge performance boost
>> but we were also able to run containers produced by docker without
>> modification.
>>
>> You mention that docker is "detail of one runner+vendor corrupting all
>> the project and adding complexity and work to everyone". It sounds like
>> you have a specific example you'd like to share? Is there a runner that is
>> unable to move to portability because of docker?
>>
>
> IBM one for instance, some custom ones like an hazelcast based one, etc...
> More generally any runner developped outside beam itself - even if we take
> a snapshot today, most of beam's ones have the same pitall.
>
> Note: i never said docker was a bad techno or so. Let me try to clarify.
>
> Main issue is that you enforce docker usage which is still trendy. It is
> like scla which was promishing to kill java, check what it does today...
> It starts to be tooled but it is also very impacting on the deployment
> side and for a good number of beam users who deploy it outside the cloud it
> is an issue.
> Keep in mind beam is embeddable by design, it is not a runner environment
> and with the docker choice it imposes some environment which is
> inconsistent with beam design itself and this is where this choice blocks.
>
>
>>
>> Andrew
>>
>> On Fri, May 4, 2018 at 4:32 PM Henning Rohde  wrote:
>>
>>> Romain,
>>>
>>> Docker, unlike selinux, solves a great number of tangible problems for
>>> us with IMO a relatively small tax. It does not have to be the only way.
>>> Some of the concerns you bring up along with possibilities were also
>>> discussed here: https://s.apache.org/beam-fn-api-container-contract. I
>>> encourage you to take a look.
>>>
>>> Thanks,
>>>  Henning
>>>
>>>
>>> On Fri, May 4, 2018 at 3:18 PM Romain Manni-Bucau 
>>> wrote:
>>>


 Le 4 mai 2018 21:31, "Henning Rohde"  a écrit :

 I disagree with the characterization of docker and the implications
 made towards portability. Graal looks like a neat project (and I never
 thought I would live to see the phrase "Practical Partial Evaluation" ..),
 but it doesn't address the needs of portability. In addition to Luke's
 examples, Go and most other languages don't work on it either. Docker
 containers also address packaging, OS dependencies, conflicting versions
 and distribution aspects in addition to truly universal language support.


 This is wrong, docker also has its conflicts, is not universal (fails
 on windows and mac easily - as host or not, cloud vendors put layers
 limiting or corrupting it, and it is an infra constraint imposed and a
 vendor locking not welcomed in beam IMHO).

 This is my main concern. All the work done looks like an implemzntation
 detail of one runner+vendor corrupting all the project and adding
 complexity and work to everyone instead of keeping it localised
 (technically it is possible).

 Would you accept i enforce you to use selinux? Using docker is the same
 kind of constraint.


 That said, it's entirely fine for some runners to use Jython, Graal,
 etc to provide a specialized offering similar to the direct runners, but it
 would be disjoint from portability IMO.

 On Fri, May 4, 2018 at 10:14 AM Romain Manni-Bucau <
 rmannibu...@gmail.com> wrote:

>
>
> Le 4 mai 2018 17:55, "Lukasz Cwik"  a écrit :
>
> I did take a look at Graal a while back when thinking about how
> execution environments could be defined, my concerns were related to it 
> not
> supporting all of the features of a language.
> For example, 

Re: Graal instead of docker?

2018-05-05 Thread Romain Manni-Bucau
2018-05-05 2:33 GMT+02:00 Andrew Pilloud :

> What docker really buys is a package format and runtime environment that
> is language and operating system agnostic. The docker packaging and
> runtime format is the de facto standard for portable applications such as
> this, and there is a group trying to turn it into an actual standard.
>
> I would agree with you that dockerd has become bloated but there are
> projects that solve that. There is no longer lock-in to dockerd, there
> are package format compatible docker replacements that eliminate the
> performance issues and overhead associated with docker. CRI-O (
> https://github.com/kubernetes-incubator/cri-o) is a really cool RedHat
> project which is a minimalist replacement for docker. I was recently
> working at a startup where I migrated our "data mover" appliance from
> Docker to CRI-O. Our application was able to get direct access to the
> ethernet driver and block devices which enabled a huge performance boost
> but we were also able to run containers produced by docker without
> modification.
>
> You mention that docker is "detail of one runner+vendor corrupting all
> the project and adding complexity and work to everyone". It sounds like
> you have a specific example you'd like to share? Is there a runner that is
> unable to move to portability because of docker?
>

IBM one for instance, some custom ones like an hazelcast based one, etc...
More generally any runner developped outside beam itself - even if we take
a snapshot today, most of beam's ones have the same pitall.

Note: i never said docker was a bad techno or so. Let me try to clarify.

Main issue is that you enforce docker usage which is still trendy. It is
like scla which was promishing to kill java, check what it does today...
It starts to be tooled but it is also very impacting on the deployment side
and for a good number of beam users who deploy it outside the cloud it is
an issue.
Keep in mind beam is embeddable by design, it is not a runner environment
and with the docker choice it imposes some environment which is
inconsistent with beam design itself and this is where this choice blocks.


>
> Andrew
>
> On Fri, May 4, 2018 at 4:32 PM Henning Rohde  wrote:
>
>> Romain,
>>
>> Docker, unlike selinux, solves a great number of tangible problems for us
>> with IMO a relatively small tax. It does not have to be the only way. Some
>> of the concerns you bring up along with possibilities were also discussed
>> here: https://s.apache.org/beam-fn-api-container-contract. I encourage
>> you to take a look.
>>
>> Thanks,
>>  Henning
>>
>>
>> On Fri, May 4, 2018 at 3:18 PM Romain Manni-Bucau 
>> wrote:
>>
>>>
>>>
>>> Le 4 mai 2018 21:31, "Henning Rohde"  a écrit :
>>>
>>> I disagree with the characterization of docker and the implications made
>>> towards portability. Graal looks like a neat project (and I never
>>> thought I would live to see the phrase "Practical Partial Evaluation" ..),
>>> but it doesn't address the needs of portability. In addition to Luke's
>>> examples, Go and most other languages don't work on it either. Docker
>>> containers also address packaging, OS dependencies, conflicting versions
>>> and distribution aspects in addition to truly universal language support.
>>>
>>>
>>> This is wrong, docker also has its conflicts, is not universal (fails on
>>> windows and mac easily - as host or not, cloud vendors put layers limiting
>>> or corrupting it, and it is an infra constraint imposed and a vendor
>>> locking not welcomed in beam IMHO).
>>>
>>> This is my main concern. All the work done looks like an implemzntation
>>> detail of one runner+vendor corrupting all the project and adding
>>> complexity and work to everyone instead of keeping it localised
>>> (technically it is possible).
>>>
>>> Would you accept i enforce you to use selinux? Using docker is the same
>>> kind of constraint.
>>>
>>>
>>> That said, it's entirely fine for some runners to use Jython, Graal, etc
>>> to provide a specialized offering similar to the direct runners, but it
>>> would be disjoint from portability IMO.
>>>
>>> On Fri, May 4, 2018 at 10:14 AM Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>


 Le 4 mai 2018 17:55, "Lukasz Cwik"  a écrit :

 I did take a look at Graal a while back when thinking about how
 execution environments could be defined, my concerns were related to it not
 supporting all of the features of a language.
 For example, its typical for Python to load and call native libraries
 and Graal can only execute C/C++ code that has been compiled to LLVM.
 Also, a good amount of people interested in using ML libraries will
 want access to GPUs to improve performance which I believe that Graal can't
 support.

 It can be a very useful way to run simple lamda functions written in
 some language directly without 

Re: Graal instead of docker?

2018-05-04 Thread Henning Rohde
Romain,

Docker, unlike selinux, solves a great number of tangible problems for us
with IMO a relatively small tax. It does not have to be the only way. Some
of the concerns you bring up along with possibilities were also discussed
here: https://s.apache.org/beam-fn-api-container-contract. I encourage you
to take a look.

Thanks,
 Henning


On Fri, May 4, 2018 at 3:18 PM Romain Manni-Bucau 
wrote:

>
>
> Le 4 mai 2018 21:31, "Henning Rohde"  a écrit :
>
> I disagree with the characterization of docker and the implications made
> towards portability. Graal looks like a neat project (and I never thought
> I would live to see the phrase "Practical Partial Evaluation" ..), but it
> doesn't address the needs of portability. In addition to Luke's examples,
> Go and most other languages don't work on it either. Docker containers also
> address packaging, OS dependencies, conflicting versions and distribution
> aspects in addition to truly universal language support.
>
>
> This is wrong, docker also has its conflicts, is not universal (fails on
> windows and mac easily - as host or not, cloud vendors put layers limiting
> or corrupting it, and it is an infra constraint imposed and a vendor
> locking not welcomed in beam IMHO).
>
> This is my main concern. All the work done looks like an implemzntation
> detail of one runner+vendor corrupting all the project and adding
> complexity and work to everyone instead of keeping it localised
> (technically it is possible).
>
> Would you accept i enforce you to use selinux? Using docker is the same
> kind of constraint.
>
>
> That said, it's entirely fine for some runners to use Jython, Graal, etc
> to provide a specialized offering similar to the direct runners, but it
> would be disjoint from portability IMO.
>
> On Fri, May 4, 2018 at 10:14 AM Romain Manni-Bucau 
> wrote:
>
>>
>>
>> Le 4 mai 2018 17:55, "Lukasz Cwik"  a écrit :
>>
>> I did take a look at Graal a while back when thinking about how execution
>> environments could be defined, my concerns were related to it not
>> supporting all of the features of a language.
>> For example, its typical for Python to load and call native libraries and
>> Graal can only execute C/C++ code that has been compiled to LLVM.
>> Also, a good amount of people interested in using ML libraries will want
>> access to GPUs to improve performance which I believe that Graal can't
>> support.
>>
>> It can be a very useful way to run simple lamda functions written in some
>> language directly without needing to use a docker environment but you could
>> probably use something even lighter weight then Graal that is language
>> specific like Jython.
>>
>>
>>
>> Right, the jsr223 impl works very well but you can also have a perf boost
>> using native (like v8 java binding for js for instance). It is way more
>> efficient than docker most of the time and not code intrusive at all in
>> runners so likely more adoption-able and maintainable. That said all is
>> doable behind the jsr223 so maybe not a big deal in terms of api. We just
>> need to ensure portability work stay clean and actually portable and doesnt
>> impact runners as poc done until today did.
>>
>> Works for me.
>>
>>
>> On Thu, May 3, 2018 at 10:05 PM Romain Manni-Bucau 
>> wrote:
>>
>>> Hi guys
>>>
>>> Since some time there are efforts to have a language portable support in
>>> beam but I cant really find a case it "works" being based on docker except
>>> for some vendor specific infra.
>>>
>>> Current solution:
>>>
>>> 1. Is runner intrusive (which is bad for beam and prevents adoption of
>>> big data vendors)
>>> 2. Based on docker (which assumed a runtime environment and is very
>>> ops/infra intrusive and likely too $$ quite often for what it brings)
>>>
>>> Did anyone had a look to graal which seems a way to make the feature
>>> doable in a lighter manner and optimized compared to default jsr223 impls?
>>>
>>>
>>
>


Re: Graal instead of docker?

2018-05-04 Thread Romain Manni-Bucau
Le 4 mai 2018 21:31, "Henning Rohde"  a écrit :

I disagree with the characterization of docker and the implications made
towards portability. Graal looks like a neat project (and I never thought I
would live to see the phrase "Practical Partial Evaluation" ..), but it
doesn't address the needs of portability. In addition to Luke's examples,
Go and most other languages don't work on it either. Docker containers also
address packaging, OS dependencies, conflicting versions and distribution
aspects in addition to truly universal language support.


This is wrong, docker also has its conflicts, is not universal (fails on
windows and mac easily - as host or not, cloud vendors put layers limiting
or corrupting it, and it is an infra constraint imposed and a vendor
locking not welcomed in beam IMHO).

This is my main concern. All the work done looks like an implemzntation
detail of one runner+vendor corrupting all the project and adding
complexity and work to everyone instead of keeping it localised
(technically it is possible).

Would you accept i enforce you to use selinux? Using docker is the same
kind of constraint.


That said, it's entirely fine for some runners to use Jython, Graal, etc to
provide a specialized offering similar to the direct runners, but it would
be disjoint from portability IMO.

On Fri, May 4, 2018 at 10:14 AM Romain Manni-Bucau 
wrote:

>
>
> Le 4 mai 2018 17:55, "Lukasz Cwik"  a écrit :
>
> I did take a look at Graal a while back when thinking about how execution
> environments could be defined, my concerns were related to it not
> supporting all of the features of a language.
> For example, its typical for Python to load and call native libraries and
> Graal can only execute C/C++ code that has been compiled to LLVM.
> Also, a good amount of people interested in using ML libraries will want
> access to GPUs to improve performance which I believe that Graal can't
> support.
>
> It can be a very useful way to run simple lamda functions written in some
> language directly without needing to use a docker environment but you could
> probably use something even lighter weight then Graal that is language
> specific like Jython.
>
>
>
> Right, the jsr223 impl works very well but you can also have a perf boost
> using native (like v8 java binding for js for instance). It is way more
> efficient than docker most of the time and not code intrusive at all in
> runners so likely more adoption-able and maintainable. That said all is
> doable behind the jsr223 so maybe not a big deal in terms of api. We just
> need to ensure portability work stay clean and actually portable and doesnt
> impact runners as poc done until today did.
>
> Works for me.
>
>
> On Thu, May 3, 2018 at 10:05 PM Romain Manni-Bucau 
> wrote:
>
>> Hi guys
>>
>> Since some time there are efforts to have a language portable support in
>> beam but I cant really find a case it "works" being based on docker except
>> for some vendor specific infra.
>>
>> Current solution:
>>
>> 1. Is runner intrusive (which is bad for beam and prevents adoption of
>> big data vendors)
>> 2. Based on docker (which assumed a runtime environment and is very
>> ops/infra intrusive and likely too $$ quite often for what it brings)
>>
>> Did anyone had a look to graal which seems a way to make the feature
>> doable in a lighter manner and optimized compared to default jsr223 impls?
>>
>>
>


Re: Graal instead of docker?

2018-05-04 Thread Henning Rohde
I disagree with the characterization of docker and the implications made
towards portability. Graal looks like a neat project (and I never thought I
would live to see the phrase "Practical Partial Evaluation" ..), but it
doesn't address the needs of portability. In addition to Luke's examples,
Go and most other languages don't work on it either. Docker containers also
address packaging, OS dependencies, conflicting versions and distribution
aspects in addition to truly universal language support.

That said, it's entirely fine for some runners to use Jython, Graal, etc to
provide a specialized offering similar to the direct runners, but it would
be disjoint from portability IMO.

On Fri, May 4, 2018 at 10:14 AM Romain Manni-Bucau 
wrote:

>
>
> Le 4 mai 2018 17:55, "Lukasz Cwik"  a écrit :
>
> I did take a look at Graal a while back when thinking about how execution
> environments could be defined, my concerns were related to it not
> supporting all of the features of a language.
> For example, its typical for Python to load and call native libraries and
> Graal can only execute C/C++ code that has been compiled to LLVM.
> Also, a good amount of people interested in using ML libraries will want
> access to GPUs to improve performance which I believe that Graal can't
> support.
>
> It can be a very useful way to run simple lamda functions written in some
> language directly without needing to use a docker environment but you could
> probably use something even lighter weight then Graal that is language
> specific like Jython.
>
>
>
> Right, the jsr223 impl works very well but you can also have a perf boost
> using native (like v8 java binding for js for instance). It is way more
> efficient than docker most of the time and not code intrusive at all in
> runners so likely more adoption-able and maintainable. That said all is
> doable behind the jsr223 so maybe not a big deal in terms of api. We just
> need to ensure portability work stay clean and actually portable and doesnt
> impact runners as poc done until today did.
>
> Works for me.
>
>
> On Thu, May 3, 2018 at 10:05 PM Romain Manni-Bucau 
> wrote:
>
>> Hi guys
>>
>> Since some time there are efforts to have a language portable support in
>> beam but I cant really find a case it "works" being based on docker except
>> for some vendor specific infra.
>>
>> Current solution:
>>
>> 1. Is runner intrusive (which is bad for beam and prevents adoption of
>> big data vendors)
>> 2. Based on docker (which assumed a runtime environment and is very
>> ops/infra intrusive and likely too $$ quite often for what it brings)
>>
>> Did anyone had a look to graal which seems a way to make the feature
>> doable in a lighter manner and optimized compared to default jsr223 impls?
>>
>>
>


Re: Graal instead of docker?

2018-05-04 Thread Romain Manni-Bucau
Le 4 mai 2018 17:55, "Lukasz Cwik"  a écrit :

I did take a look at Graal a while back when thinking about how execution
environments could be defined, my concerns were related to it not
supporting all of the features of a language.
For example, its typical for Python to load and call native libraries and
Graal can only execute C/C++ code that has been compiled to LLVM.
Also, a good amount of people interested in using ML libraries will want
access to GPUs to improve performance which I believe that Graal can't
support.

It can be a very useful way to run simple lamda functions written in some
language directly without needing to use a docker environment but you could
probably use something even lighter weight then Graal that is language
specific like Jython.



Right, the jsr223 impl works very well but you can also have a perf boost
using native (like v8 java binding for js for instance). It is way more
efficient than docker most of the time and not code intrusive at all in
runners so likely more adoption-able and maintainable. That said all is
doable behind the jsr223 so maybe not a big deal in terms of api. We just
need to ensure portability work stay clean and actually portable and doesnt
impact runners as poc done until today did.

Works for me.


On Thu, May 3, 2018 at 10:05 PM Romain Manni-Bucau 
wrote:

> Hi guys
>
> Since some time there are efforts to have a language portable support in
> beam but I cant really find a case it "works" being based on docker except
> for some vendor specific infra.
>
> Current solution:
>
> 1. Is runner intrusive (which is bad for beam and prevents adoption of big
> data vendors)
> 2. Based on docker (which assumed a runtime environment and is very
> ops/infra intrusive and likely too $$ quite often for what it brings)
>
> Did anyone had a look to graal which seems a way to make the feature
> doable in a lighter manner and optimized compared to default jsr223 impls?
>
>


Re: Graal instead of docker?

2018-05-04 Thread Lukasz Cwik
I did take a look at Graal a while back when thinking about how execution
environments could be defined, my concerns were related to it not
supporting all of the features of a language.
For example, its typical for Python to load and call native libraries and
Graal can only execute C/C++ code that has been compiled to LLVM.
Also, a good amount of people interested in using ML libraries will want
access to GPUs to improve performance which I believe that Graal can't
support.

It can be a very useful way to run simple lamda functions written in some
language directly without needing to use a docker environment but you could
probably use something even lighter weight then Graal that is language
specific like Jython.

On Thu, May 3, 2018 at 10:05 PM Romain Manni-Bucau 
wrote:

> Hi guys
>
> Since some time there are efforts to have a language portable support in
> beam but I cant really find a case it "works" being based on docker except
> for some vendor specific infra.
>
> Current solution:
>
> 1. Is runner intrusive (which is bad for beam and prevents adoption of big
> data vendors)
> 2. Based on docker (which assumed a runtime environment and is very
> ops/infra intrusive and likely too $$ quite often for what it brings)
>
> Did anyone had a look to graal which seems a way to make the feature
> doable in a lighter manner and optimized compared to default jsr223 impls?
>
>