On Fri, May 11, 2018 at 11:48 AM Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

>
>
> Le ven. 11 mai 2018 18:15, Andrew Pilloud <apill...@google.com> a écrit :
>
>> Json and Protobuf aren't the same thing. Json is for exchanging
>> unstructured data, Protobuf is for exchanging structured data. The point of
>> Portability is to define a protocol for exchanging structured messages
>> across languages. What do you propose using on top of Json to define
>> message structure?
>>
>
> Im fine with protobuf contracts, not with all the rest (libs*). Json has
> the advantage to not require much for consumers and be easy to integrate
> and proxy. Protobuf imposes a lot for that layer which will be typed by the
> runner anyway so no need of 2 typings layers.
>
>
>> I'd like to see the generic runner rewritten in Golang so we can
>> eliminate the significant overhead imposed by the JVM. I would argue that
>> Go is the best language for low overhead infrastructure, and is already
>> widely used by projects in this space such as Docker, Kubernetes, InfluxDB.
>> Even SQL can take advantage of this. For example, several runners could be
>> passed raw SQL and use their own SQL engines to implement more efficient
>> transforms then generic Beam can. Users will save significant $$$ on
>> infrastructure by not having to involve the JVM at all.
>>
>
> Yes...or no. Jvm overhead is very low gor such infra, less than 64M of ram
> and almost no cpu so will not help much for cluster or long lived processes
> like the ones we talk about.
>
> Also beam community is java - dont answer it is python or go without
> checking ;). Not sure adding a new language will help and give a face
> people will like to contribute or use the project.
>
> Currently id say the way the router runner is done is a detail but the
> choice to rethink current impl a crucial atchitectural point.
>
> No issue having N router impls too, one in java, one in go (but thought we
> had very few go resources/lovers in beam?), one in python where it would
> make a lot of sense and would show a router without actual primitive impl
> (delegating to direct java runner), etc...
>
> But one thing at a time, anyone against stopping current impl track,
> revert it and move to a higher level runner?
>
I am still not clear as to exactly what kind of change you are proposing.
First it looked like you were proposing to not have a hard dependency on
Docker, and that got resolved (we don't). Now it sounds like you're against
Java protobuf libraries, but that doesn't strike me as something warranting
an architecture change. If you have something else in mind, I'm afraid from
your recent emails I can't tell what it is.

Could you please create a document detailing:
- What precisely are the issues you see with some of the current
portability APIs
- How you think those APIs should look like instead
- How you think the path from the current implementation to your desired
state would look like
- If you're proposing major changes to the direction of work of a large
number of people, then please also elaborate in your document as to what
impact your proposal will have on the current work, or how this impact can
be minimized.

Please make sure to scan the pre-existing portability design documents to
see if similar concerns had already been discussed before. As others have
pointed out in this thread several times, almost everything that you're
asking has already been discussed. If you believe the discussion of a
particular issue has been insufficient, feel free to re-raise it on the
mailing list, by linking to the previous discussion and elaborating what
aspect you think has been missed; if you can't find the discussion of a
particular crucial design decision, feel free to raise that on the mailing
list too and people will be happy to help you find it.

I would like also to ask you to adjust the tone of your comments such as
"beam is being driven by an implementation instead of a clear and scalable
architecture", "All the work done looks like an implemzntation detail of
one runner+vendor corrupting all the project" and "A bad architecture which
doeent embrace the community". These kinds of comments, to me, sound not
only unconstructively vague, but dismissive of the years of design and
implementation work done by dozens of people in this area. We are doing
something that's never done before, and the APIs and implementation are not
perfect and will continue evolving, but there are much more effective and
friendly ways to point out use cases where they fail or ways in which they
can be improved.


>
>
>> Andrew
>>
>> On Fri, May 11, 2018 at 8:53 AM Romain Manni-Bucau <rmannibu...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> Le mer. 9 mai 2018 17:41, Eugene Kirpichov <kirpic...@google.com> a
>>> écrit :
>>>
>>>>
>>>>
>>>> On Wed, May 9, 2018 at 1:08 AM Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> Le mer. 9 mai 2018 00:57, Henning Rohde <hero...@google.com> a écrit :
>>>>>
>>>>>> There are indeed lots of possibilities for interesting docker
>>>>>> alternatives with different tradeoffs and capabilities, but in generally
>>>>>> both the runner as well as the SDK must support them for it to work. As
>>>>>> mentioned, docker (as used in the container contract) is meant as a
>>>>>> flexible main option but not necessarily the only option. I see no 
>>>>>> problem
>>>>>> with certain pipeline-SDK-runner combinations additionally supporting a
>>>>>> specialized setup. Pipeline can be a factor, because that some transforms
>>>>>> might depend on aspects of the runtime environment -- such as system
>>>>>> libraries or shelling out to a /bin/foo.
>>>>>>
>>>>>> The worker boot code is tied to the current container contract, so
>>>>>> pre-launched workers would presumably not use that code path and are not 
>>>>>> be
>>>>>> bound by its assumptions. In particular, such a setup might want to 
>>>>>> invert
>>>>>> who initiates the connection from the SDK worker to the runner. Pipeline
>>>>>> options and global state in the SDK and user functions process might make
>>>>>> it difficult to safely reuse worker processes across pipelines, but also
>>>>>> doable in certain scenarios.
>>>>>>
>>>>>
>>>>> This is not that hard actually and most java env do it.
>>>>>
>>>>> Main concern is 1. Being tight to an impl detail and 2. A bad
>>>>> architecture which doeent embrace the community
>>>>>
>>>> Could you please be more specific? Concerns about Docker dependency
>>>> have already been repeatedly addressed in this thread.
>>>>
>>>
>>> My concern is that beam is being driven by an implementation instead of
>>> a clear and scalable architecture.
>>>
>>> The best demonstration is the protobuf usage which is far to be the best
>>> choice for portability these days due to the implication of its stack in
>>> several languages (nobody wants it in its classpath in java/scala these
>>> days for instance cause of conflicts or security careness its requires).
>>> Json is very tooled and trivial to use whatever lib you want to rely on, in
>>> any language or environment to cite just one alternative.
>>>
>>> Being portable (language) is a good goal but IMHO requires:
>>>
>>> 1. Runners in each language (otherwise fallback on the jsr223 and you
>>> are good with just a json facade)
>>> 2. A generic runner able to route each task to the right native runner
>>> 3. A way to run in a single runner when relevant (keep in mind most of
>>> java users dont even want to see python or portable code or api in their
>>> classpath and runner)
>>>
>>>
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Henning
>>>>>>
>>>>>> On Tue, May 8, 2018 at 3:51 PM Thomas Weise <t...@apache.org> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 5, 2018 at 3:58 PM, Robert Bradshaw <rober...@google.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> I would welcome changes to
>>>>>>>>
>>>>>>>> https://github.com/apache/beam/blob/v2.4.0/model/pipeline/src/main/proto/beam_runner_api.proto#L730
>>>>>>>> that would provide alternatives to docker (one of which comes to
>>>>>>>> mind is "I
>>>>>>>> already brought up a worker(s) for you (which could be the same
>>>>>>>> process
>>>>>>>> that handled pipeline construction in testing scenarios), here's
>>>>>>>> how to
>>>>>>>> connect to it/them.") Another option, which would seem to appeal to
>>>>>>>> you in
>>>>>>>> particular, would be "the worker code is linked into the runner's
>>>>>>>> binary,
>>>>>>>> use this process as the worker" (though note even for java-on-java,
>>>>>>>> it can
>>>>>>>> be advantageous to shield the worker and runner code from each
>>>>>>>> others
>>>>>>>> environments, dependencies, and version requirements.) This latter
>>>>>>>> should
>>>>>>>> still likely use the FnApi to talk to itself (either over GRPC on
>>>>>>>> local
>>>>>>>> ports, or possibly better via direct function calls eliminating the
>>>>>>>> RPC
>>>>>>>> overhead altogether--this is how the fast local runner in Python
>>>>>>>> works).
>>>>>>>> There may be runner environments well controlled enough that "start
>>>>>>>> up the
>>>>>>>> workers" could be specified as "run this command line." We should
>>>>>>>> make this
>>>>>>>> environment message extensible to other alternatives than "docker
>>>>>>>> container
>>>>>>>> url," though of course we don't want the set of options to grow too
>>>>>>>> large
>>>>>>>> or we loose the promise of portability unless every runner supports
>>>>>>>> every
>>>>>>>> protocol.
>>>>>>>>
>>>>>>>>
>>>>>>> The pre-launched worker would be an interesting option, which might
>>>>>>> work well for a sidecar deployment.
>>>>>>>
>>>>>>> The current worker boot code though makes the assumption that the
>>>>>>> runner endpoint to phone home to is known when the process is launched.
>>>>>>> That doesn't work so well with a runner that establishes its endpoint
>>>>>>> dynamically. Also, the assumption is baked in that a worker will only 
>>>>>>> serve
>>>>>>> a single pipeline (provisioning API etc.).
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Thomas
>>>>>>>
>>>>>>>
>>>>>>

Reply via email to