Le ven. 11 mai 2018 18:15, Andrew Pilloud <apill...@google.com> a écrit :
> Json and Protobuf aren't the same thing. Json is for exchanging > unstructured data, Protobuf is for exchanging structured data. The point of > Portability is to define a protocol for exchanging structured messages > across languages. What do you propose using on top of Json to define > message structure? > Im fine with protobuf contracts, not with all the rest (libs*). Json has the advantage to not require much for consumers and be easy to integrate and proxy. Protobuf imposes a lot for that layer which will be typed by the runner anyway so no need of 2 typings layers. > I'd like to see the generic runner rewritten in Golang so we can eliminate > the significant overhead imposed by the JVM. I would argue that Go is the > best language for low overhead infrastructure, and is already widely used > by projects in this space such as Docker, Kubernetes, InfluxDB. Even SQL > can take advantage of this. For example, several runners could be passed > raw SQL and use their own SQL engines to implement more efficient > transforms then generic Beam can. Users will save significant $$$ on > infrastructure by not having to involve the JVM at all. > Yes...or no. Jvm overhead is very low gor such infra, less than 64M of ram and almost no cpu so will not help much for cluster or long lived processes like the ones we talk about. Also beam community is java - dont answer it is python or go without checking ;). Not sure adding a new language will help and give a face people will like to contribute or use the project. Currently id say the way the router runner is done is a detail but the choice to rethink current impl a crucial atchitectural point. No issue having N router impls too, one in java, one in go (but thought we had very few go resources/lovers in beam?), one in python where it would make a lot of sense and would show a router without actual primitive impl (delegating to direct java runner), etc... But one thing at a time, anyone against stopping current impl track, revert it and move to a higher level runner? > Andrew > > On Fri, May 11, 2018 at 8:53 AM Romain Manni-Bucau <rmannibu...@gmail.com> > wrote: > >> >> >> Le mer. 9 mai 2018 17:41, Eugene Kirpichov <kirpic...@google.com> a >> écrit : >> >>> >>> >>> On Wed, May 9, 2018 at 1:08 AM Romain Manni-Bucau <rmannibu...@gmail.com> >>> wrote: >>> >>>> >>>> >>>> Le mer. 9 mai 2018 00:57, Henning Rohde <hero...@google.com> a écrit : >>>> >>>>> There are indeed lots of possibilities for interesting docker >>>>> alternatives with different tradeoffs and capabilities, but in generally >>>>> both the runner as well as the SDK must support them for it to work. As >>>>> mentioned, docker (as used in the container contract) is meant as a >>>>> flexible main option but not necessarily the only option. I see no problem >>>>> with certain pipeline-SDK-runner combinations additionally supporting a >>>>> specialized setup. Pipeline can be a factor, because that some transforms >>>>> might depend on aspects of the runtime environment -- such as system >>>>> libraries or shelling out to a /bin/foo. >>>>> >>>>> The worker boot code is tied to the current container contract, so >>>>> pre-launched workers would presumably not use that code path and are not >>>>> be >>>>> bound by its assumptions. In particular, such a setup might want to invert >>>>> who initiates the connection from the SDK worker to the runner. Pipeline >>>>> options and global state in the SDK and user functions process might make >>>>> it difficult to safely reuse worker processes across pipelines, but also >>>>> doable in certain scenarios. >>>>> >>>> >>>> This is not that hard actually and most java env do it. >>>> >>>> Main concern is 1. Being tight to an impl detail and 2. A bad >>>> architecture which doeent embrace the community >>>> >>> Could you please be more specific? Concerns about Docker dependency have >>> already been repeatedly addressed in this thread. >>> >> >> My concern is that beam is being driven by an implementation instead of a >> clear and scalable architecture. >> >> The best demonstration is the protobuf usage which is far to be the best >> choice for portability these days due to the implication of its stack in >> several languages (nobody wants it in its classpath in java/scala these >> days for instance cause of conflicts or security careness its requires). >> Json is very tooled and trivial to use whatever lib you want to rely on, in >> any language or environment to cite just one alternative. >> >> Being portable (language) is a good goal but IMHO requires: >> >> 1. Runners in each language (otherwise fallback on the jsr223 and you are >> good with just a json facade) >> 2. A generic runner able to route each task to the right native runner >> 3. A way to run in a single runner when relevant (keep in mind most of >> java users dont even want to see python or portable code or api in their >> classpath and runner) >> >> >> >> >>> >>>> >>>> >>>> >>>>> Henning >>>>> >>>>> On Tue, May 8, 2018 at 3:51 PM Thomas Weise <t...@apache.org> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Sat, May 5, 2018 at 3:58 PM, Robert Bradshaw <rober...@google.com> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> I would welcome changes to >>>>>>> >>>>>>> https://github.com/apache/beam/blob/v2.4.0/model/pipeline/src/main/proto/beam_runner_api.proto#L730 >>>>>>> that would provide alternatives to docker (one of which comes to >>>>>>> mind is "I >>>>>>> already brought up a worker(s) for you (which could be the same >>>>>>> process >>>>>>> that handled pipeline construction in testing scenarios), here's how >>>>>>> to >>>>>>> connect to it/them.") Another option, which would seem to appeal to >>>>>>> you in >>>>>>> particular, would be "the worker code is linked into the runner's >>>>>>> binary, >>>>>>> use this process as the worker" (though note even for java-on-java, >>>>>>> it can >>>>>>> be advantageous to shield the worker and runner code from each others >>>>>>> environments, dependencies, and version requirements.) This latter >>>>>>> should >>>>>>> still likely use the FnApi to talk to itself (either over GRPC on >>>>>>> local >>>>>>> ports, or possibly better via direct function calls eliminating the >>>>>>> RPC >>>>>>> overhead altogether--this is how the fast local runner in Python >>>>>>> works). >>>>>>> There may be runner environments well controlled enough that "start >>>>>>> up the >>>>>>> workers" could be specified as "run this command line." We should >>>>>>> make this >>>>>>> environment message extensible to other alternatives than "docker >>>>>>> container >>>>>>> url," though of course we don't want the set of options to grow too >>>>>>> large >>>>>>> or we loose the promise of portability unless every runner supports >>>>>>> every >>>>>>> protocol. >>>>>>> >>>>>>> >>>>>> The pre-launched worker would be an interesting option, which might >>>>>> work well for a sidecar deployment. >>>>>> >>>>>> The current worker boot code though makes the assumption that the >>>>>> runner endpoint to phone home to is known when the process is launched. >>>>>> That doesn't work so well with a runner that establishes its endpoint >>>>>> dynamically. Also, the assumption is baked in that a worker will only >>>>>> serve >>>>>> a single pipeline (provisioning API etc.). >>>>>> >>>>>> Thanks, >>>>>> Thomas >>>>>> >>>>>> >>>>>