Le mar. 8 mai 2018 10:16, Robert Bradshaw <rober...@google.com> a écrit :

> On Sun, May 6, 2018 at 1:30 AM Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
> > Wow, this mail should be on the website Robert, thanks for it
>
> > I still have a point to try to understand better: my view is that once
> submitted the only perf related point is when you hit a flow of data. So a
> split can be slow bit it is not a that big deal. So a runner integration
> only needs to optimize process and nextElement logics, right?
>
> Yes. In some streaming cases (e.g. microbatch like Spark or Dataflow) there
> may be many, many bundles, so the "control plane" part can't be /too/ slow,
> but it's not as performance critical.
>
> > It is almost always doable to batch that - with triggers and other
> constraints. So the portable model is elegant but not done to be "fast" in
> current state of impl.
>
> Actually batching and streaming RPCs for the data plane has been there from
> the start, for these reasons.
>
> > So this all leads to 2 needs:
>
> > 1. Have some native runner for dev
> > 2. Have some bulk api for prod
>
> > In all cases this is decoralated of any runner no? Can even be a beam
> subproject built on top of beam which would be very sane and ensure a clear
> separation of concerns no?
>
> The thing to do here would be to extend the Environment (message) to allow
> for alternatives, and then abstract out the creation of an bundle executor
> such that different once could be instantiated based on this environment.
>

Agree so we need a generic runner delegating to "subrunners" (or runner
impl) instead of impl-ing it in all runners. Sounds very sane, scalable and
extensible/composable this way.

Can we mark it as a backlog item and goal?



> > Le 6 mai 2018 00:59, "Robert Bradshaw" <rober...@google.com> a écrit :
>
> >> Portability, at its core, is providing a spec for any runner to talk to
> any
> >> SDK. I personally think it's done a great job in cleaning up the model
> by
> >> forcing us to define a clean boundary (as specified at
> >> https://github.com/apache/beam/tree/master/model ) between these two
> >> components (even if the implementations of one or the other are
> >> (temporarily, I hope for the most part) complicated).The pipeline (on
> the
> >> runner submission side) and work execution (on what has traditionally
> been
> >> called the fn api side) have concrete platform-independent descriptions,
> >> rather than being a set of Java classes.
>
> >> Currently, the portion that lives on the "runner" side of this boundary
> is
> >> often shared among Java runners (via libraries like runners core), but
> it
> >> is all still part of each runner, and because of this it removes the
> >> requirement for the Runner be Java just like it remove the requirement
> for
> >> the SDK to speak Java. (For example, I think a Python Dask runner makes
> a
> >> lot of sense, Dataflow may decide to implement larger portions of its
> >> runner in Go or C++ or even behind a service, and I've used the Python
> >> ULRunner to run the Java SDK over the Fn API for testing development
> >> purposes).
>
> >> There is also the question of "why docker?" I actually don't see docker
> all
> >> that intrinsic to the protocol; one only needs to be able to define and
> >> bring up workers that communicate on specified ports. Docker happens to
> be
> >> a fairly well supported way to package up an arbitrary chunk of code (in
> >> any language), together with its nearly arbitrarily specified
> >> dependencies/environment, in a way that's well specified and easy to
> start
> >> up.
>
> >> I would welcome changes to
>
>
> https://github.com/apache/beam/blob/v2.4.0/model/pipeline/src/main/proto/beam_runner_api.proto#L730
> >> that would provide alternatives to docker (one of which comes to mind is
> "I
> >> already brought up a worker(s) for you (which could be the same process
> >> that handled pipeline construction in testing scenarios), here's how to
> >> connect to it/them.") Another option, which would seem to appeal to you
> in
> >> particular, would be "the worker code is linked into the runner's
> binary,
> >> use this process as the worker" (though note even for java-on-java, it
> can
> >> be advantageous to shield the worker and runner code from each others
> >> environments, dependencies, and version requirements.) This latter
> should
> >> still likely use the FnApi to talk to itself (either over GRPC on local
> >> ports, or possibly better via direct function calls eliminating the RPC
> >> overhead altogether--this is how the fast local runner in Python works).
> >> There may be runner environments well controlled enough that "start up
> the
> >> workers" could be specified as "run this command line." We should make
> this
> >> environment message extensible to other alternatives than "docker
> container
> >> url," though of course we don't want the set of options to grow too
> large
> >> or we loose the promise of portability unless every runner supports
> every
> >> protocol.
>
> >> Of course, the runner is always free to execute any Fn for which it
> >> completely understands the URN and the environment any way it pleases,
> e.g.
> >> directly in process, or even via lighter-weight mechanism like Jython or
> >> Graal, rather than asking an external process to do it. But we need a
> >> lowest common denominator for executing arbitrary URNs runners are not
> >> expected to understand.
>
> >> As an aside, there are also technical limitations in implementing
> >> Portability
> >> by simply requiring all runners to be Java and the portable layer simply
> >> being wrappers of UserFnInLangaugeX in an equivalent UserFnObjectInJava,
> >> executing everything as if it were pure Java. In particular the
> overheads
> >> of unnecessarily crossing the language boundaries many times in a single
> >> fused graph are often prohibitive.
>
> >> Sorry for the long email, but hopefully this helps shed some light on
> (at
> >> least how I see) the portability effort (at the core of the Beam vision
> >> statement) as well as concrete actions we can take to decouple it from
> >> specific technologies.
>
> >> - Robert
>
>
> >> On Sat, May 5, 2018 at 2:06 PM Romain Manni-Bucau <
> rmannibu...@gmail.com>
> >> wrote:
>
> >> > All are good points.
>
> >> > The only "?" I keep is: why beam doesnt uses its visitor api to make
> the
> >> portability transversal to all runners "mutating" the user model before
> >> translation? Technically it sounds easy and avoid hacking all impl. Was
> it
> >> tested and failed?
>
> >> > Le 5 mai 2018 22:50, "Thomas Weise" <t...@apache.org> a écrit :
>
> >> >> Docker isn't a silver bullet and may not be the best choice for all
> >> environments (I'm also looking at potentially launching SDK workers in a
> >> different way), but AFAIK there has not been any alternative proposal
> for
> >> default SDK execution that can handle all of Python, Go and Java.
>
> >> >> Regardless of the default implementation, we should strive to keep
> the
> >> implementation modular so users can plug in their own replacement as
> >> needed. Looking at the prototype implementation, Docker comes downstream
> of
> >> FlinkExecutableStageFunction, and it will be possible to supply a custom
> >> implementation by making the translator pluggable (which I intend to
> work
> >> on once backporting to master is complete), and possibly
> >> "SDKHarnessManager" itself can also be swapped out.
>
> >> >> I would also prefer that for Flink and other Java based runners we
> >> retain the option to inline executable stages that are in Java. I would
> >> expect a good number of use cases to benefit from direct execution in
> the
> >> task manager, and it may be good to offer the user that optimization.
>
> >> >> Thanks,
> >> >> Thomas
>
>
>
> >> >> On Sat, May 5, 2018 at 12:54 PM, Eugene Kirpichov <
> kirpic...@google.com>
> >> wrote:
>
> >> >>> To add on that: Romain, if you are really excited about Graal as a
> >> project, here are some constructive suggestions as to what you can do on
> a
> >> reasonably short timeframe:
> >> >>> - Propose/prototype a design for writing UDFs in Beam SQL using
> Graal
> >> >>> - Go through the portability-related design documents, come up with
> a
> >> more precise assessment of what parts are actually dependent on Docker's
> >> container format and/or on Docker itself, and propose a plan for
> untangling
> >> this dependency and opening the door to other mechanisms of
> cross-language
> >> execution
>
> >> >>> On Sat, May 5, 2018 at 12:50 PM Eugene Kirpichov <
> kirpic...@google.com>
> >> wrote:
>
> >> >>>> Graal is a very young project, currently nowhere near the level of
> >> maturity or completeness as to be sufficient for Beam to fully bet its
> >> portability vision on it:
> >> >>>> - Graal currently only claims to support Java and Javascript, with
> >> Ruby and R in the status of "some applications may run", Python support
> >> "just beginning", and Go lacking altogether.
> >> >>>> - Regarding existing production usage, the Graal FAQ says it is "a
> >> project with new innovative technology in its early stages."
>
> >> >>>> That said, as Graal matures, I think it would be reasonable to keep
> an
> >> eye on it as a potential future lightweight alternative to containers
> for
> >> pipelines where Graal's level of support is sufficient for this
> particular
> >> pipeline.
>
> >> >>>> Please also keep in mind that execution of user code is only a
> small
> >> part of the overall portability picture, and dependency on Docker is an
> >> even smaller part of that (there is only 1 mention of the word "Docker"
> in
> >> all of Beam's portability protos, and the mention is in an out-of-date
> TODO
> >> comment). I hope this addresses your concerns.
>
> >> >>>> On Sat, May 5, 2018 at 11:49 AM Romain Manni-Bucau <
> >> rmannibu...@gmail.com> wrote:
>
> >> >>>>> Agree
>
> >> >>>>> The jvm is still mainstream for big data and it is trivial to have
> a
> >> remote facade to support natives but no point to have it in runners, it
> is
> >> some particular transforms or even dofn and sources only...
>
>
> >> >>>>> Le 5 mai 2018 19:03, "Andrew Pilloud" <apill...@google.com> a
> écrit :
>
> >> >>>>>> Thanks for the examples earlier, I think Hazelcast is a great
> >> example of something portability might make more difficult. I'm not
> working
> >> on portability, but my understanding is that the data sent to the runner
> is
> >> a blob of code and the name of the container to run it in. A runner with
> a
> >> native language (java on Hazelcast for example) could run the code
> directly
> >> without the container if it is in a language it supports. So when
> Hazelcast
> >> sees a known java container specified, it just loads the java blob and
> runs
> >> it. When it sees another container it rejects the pipeline. You could
> use
> >> Graal in the Hazelcast runner to do this for a number of languages. I
> would
> >> expect that this could also be done in the direct runner, which
> similarly
> >> provides a native java environment, so portable Java pipelines can be
> >> tested without docker?
>
> >> >>>>>> For another way to frame this: if Beam was originally written in
> Go,
> >> we would be having a different discussion. A pipeline written entirely
> in
> >> java wouldn't be possible, so instead to enable Hazelcast, we would have
> to
> >> be able to run the java from portability without running the container.
>
> >> >>>>>> Andrew
>
> >> >>>>>> On Sat, May 5, 2018 at 1:48 AM Romain Manni-Bucau <
> >> rmannibu...@gmail.com> wrote:
>
>
>
> >> >>>>>>> 2018-05-05 9:27 GMT+02:00 Ismaël Mejía <ieme...@gmail.com>:
>
> >> >>>>>>>> Graal would not be a viable solution for the reasons Henning
> and
> >> Andrew
> >> >>>>>>>> mentioned, or put in other words, when users choose a
> programming
> >> language
> >> >>>>>>>> they don’t choose only a ‘friendly’ syntax or programming
> model,
> >> they
> >> >>>>>>>> choose also the ecosystem that comes with it, and the libraries
> >> that make
> >> >>>>>>>> their life easier. However isolating these user
> >> libraries/dependencies is a
> >> >>>>>>>> hard problem and so far the standard solution to this problem
> is
> >> to use
> >> >>>>>>>> operating systems containers via docker.
>
>
> >> >>>>>>> Graal solves that Ismael. Same kind of experience than running
> npm
> >> libs on nashorn but with a more unified API to run any language soft.
>
>
>
> >> >>>>>>>> The Beam vision from day zero is to run pipelines written in
> >> multiple
> >> >>>>>>>> languages in runners in multiple systems, and so far we are not
> >> doing this
> >> >>>>>>>> in particular in the Apache runners. The portability work is
> the
> >> cleanest
> >> >>>>>>>> way to achieve this vision given the constraints.
>
>
> >> >>>>>>> Hmm, did I read it wrong and we don't have specific integration
> of
> >> the portable API in runners? This is what is messing up the runners and
> >> limiting beam adoption on existing runners.
> >> >>>>>>> Portable API is a feature buildable on top of runner, not in
> >> runners.
> >> >>>>>>> Same as a runner implementing the 5-6 primitives can run
> anything,
> >> the portable API should just rely on that and not require more
> integration.
> >> >>>>>>> It doesn't prevent more deep integrations as for some higher
> level
> >> primitives existing in runners but it is not the case today for runners
> so
> >> shouldn't exist IMHO.
>
>
>
> >> >>>>>>>> I agree however that for the Java SDK to Java runner case this
> can
> >> >>>>>>>> represent additional pain, docker ideally should not be a
> >> requirement for
> >> >>>>>>>> Java users with the Direct runner and debugging a pipeline
> should
> >> be as
> >> >>>>>>>> easy as it is today. I think the Univerrsal Local Runner exists
> to
> >> cover
> >> >>>>>>>> the Portable case, but after looking at this JIRA I am not sure
> if
> >> >>>>>>>> unification is coming (and by consequence if docker would be
> >> mandatory).
> >> >>>>>>>> https://issues.apache.org/jira/browse/BEAM-4239
>
> >> >>>>>>>> I suppose for the distributed runners that they must implement
> the
> >> full
> >> >>>>>>>> Portability APIs to be considered Beam multi language compliant
> >> but they
> >> >>>>>>>> can prefer for performance reasons to translate without the
> >> portability
> >> >>>>>>>> APIs the Java to Java case.
>
>
>
> >> >>>>>>> This is my issue, language portability must NOT impact runners
> at
> >> all, it is just a way to forward primitives to a runner.
> >> >>>>>>> See it as a layer rewriting the pipeline and submitting it. No
> need
> >> to modify any runner.
>
>
> >> >>>>>>>> On Sat, May 5, 2018 at 9:11 AM Reuven Lax <re...@google.com>
> wrote:
>
> >> >>>>>>>> > A beam cluster with the spark runner would include a spark
> >> cluster, plus
> >> >>>>>>>> what's needed for portability, plus the beam sdk.
>
> >> >>>>>>>> > On Fri, May 4, 2018, 11:55 PM Romain Manni-Bucau <
> >> rmannibu...@gmail.com>
> >> >>>>>>>> wrote:
>
>
>
> >> >>>>>>>> >> Le 5 mai 2018 08:43, "Reuven Lax" <re...@google.com> a
> écrit
> :
>
> >> >>>>>>>> >> I don't believe we enforce docker anywhere. In fact if
> someone
> >> wanted to
> >> >>>>>>>> run an all-windows beam cluster, they would probably not use
> >> docker for
> >> >>>>>>>> their runner (docker runs on Windows, but not efficiently).
>
>
>
> >> >>>>>>>> >> Or doesnt run sometimes - a colleague hit that yesterday :(.
>
> >> >>>>>>>> >> What is a "beam cluster" - opposed to a spark or foink
> cluster?
> >> How
> >> >>>>>>>> would it work on windows servers?
>
>
> >> >>>>>>>> >> On Fri, May 4, 2018, 11:19 PM Romain Manni-Bucau <
> >> rmannibu...@gmail.com>
> >> >>>>>>>> wrote:
>
>
>
> >> >>>>>>>> >>> 2018-05-05 2:33 GMT+02:00 Andrew Pilloud <
> apill...@google.com>:
>
> >> >>>>>>>> >>>> What docker really buys is a package format and runtime
> >> environment
> >> >>>>>>>> that is language and operating system agnostic. The docker
> >> packaging and
> >> >>>>>>>> runtime format is the de facto standard for portable
> applications
> >> such as
> >> >>>>>>>> this, and there is a group trying to turn it into an actual
> >> standard.
>
> >> >>>>>>>> >>>> I would agree with you that dockerd has become bloated but
> >> there are
> >> >>>>>>>> projects that solve that. There is no longer lock-in to
> dockerd,
> >> there are
> >> >>>>>>>> package format compatible docker replacements that eliminate
> the
> >> >>>>>>>> performance issues and overhead associated with docker. CRI-O (
> >> >>>>>>>> https://github.com/kubernetes-incubator/cri-o) is a really
> cool
> >> RedHat
> >> >>>>>>>> project which is a minimalist replacement for docker. I was
> >> recently
> >> >>>>>>>> working at a startup where I migrated our "data mover"
> appliance
> >> from
> >> >>>>>>>> Docker to CRI-O. Our application was able to get direct access
> to
> >> the
> >> >>>>>>>> ethernet driver and block devices which enabled a huge
> performance
> >> boost
> >> >>>>>>>> but we were also able to run containers produced by docker
> without
> >> >>>>>>>> modification.
>
> >> >>>>>>>> >>>> You mention that docker is "detail of one runner+vendor
> >> corrupting all
> >> >>>>>>>> the project and adding complexity and work to everyone". It
> sounds
> >> like you
> >> >>>>>>>> have a specific example you'd like to share? Is there a runner
> >> that is
> >> >>>>>>>> unable to move to portability because of docker?
>
>
> >> >>>>>>>> >>> IBM one for instance, some custom ones like an hazelcast
> based
> >> one,
> >> >>>>>>>> etc... More generally any runner developped outside beam itself
> -
> >> even if
> >> >>>>>>>> we take a snapshot today, most of beam's ones have the same
> pitall.
>
> >> >>>>>>>> >>> Note: i never said docker was a bad techno or so. Let me
> try
> >> to clarify.
>
> >> >>>>>>>> >>> Main issue is that you enforce docker usage which is still
> >> trendy. It
> >> >>>>>>>> is like scla which was promishing to kill java, check what it
> does
> >> today...
> >> >>>>>>>> >>> It starts to be tooled but it is also very impacting on the
> >> deployment
> >> >>>>>>>> side and for a good number of beam users who deploy it outside
> the
> >> cloud it
> >> >>>>>>>> is an issue.
> >> >>>>>>>> >>> Keep in mind beam is embeddable by design, it is not a
> runner
> >> >>>>>>>> environment and with the docker choice it imposes some
> environment
> >> which is
> >> >>>>>>>> inconsistent with beam design itself and this is where this
> choice
> >> blocks.
>
>
>
> >> >>>>>>>> >>>> Andrew
>
> >> >>>>>>>> >>>> On Fri, May 4, 2018 at 4:32 PM Henning Rohde <
> >> hero...@google.com>
> >> >>>>>>>> wrote:
>
> >> >>>>>>>> >>>>> Romain,
>
> >> >>>>>>>> >>>>> Docker, unlike selinux, solves a great number of tangible
> >> problems
> >> >>>>>>>> for us with IMO a relatively small tax. It does not have to be
> the
> >> only
> >> >>>>>>>> way. Some of the concerns you bring up along with possibilities
> >> were also
> >> >>>>>>>> discussed here:
> >> >>>>>>>> https://s.apache.org/beam-fn-api-container-contract.
> >> I
> >> >>>>>>>> encourage you to take a look.
>
> >> >>>>>>>> >>>>> Thanks,
> >> >>>>>>>> >>>>>   Henning
>
>
> >> >>>>>>>> >>>>> On Fri, May 4, 2018 at 3:18 PM Romain Manni-Bucau <
> >> >>>>>>>> rmannibu...@gmail.com> wrote:
>
>
>
> >> >>>>>>>> >>>>>> Le 4 mai 2018 21:31, "Henning Rohde" <
> hero...@google.com>
> a
> >> écrit :
>
> >> >>>>>>>> >>>>>> I disagree with the characterization of docker and the
> >> implications
> >> >>>>>>>> made towards portability. Graal looks like a neat project (and
> I
> >> never
> >> >>>>>>>> thought I would live to see the phrase "Practical Partial
> >> Evaluation" ..),
> >> >>>>>>>> but it doesn't address the needs of portability. In addition to
> >> Luke's
> >> >>>>>>>> examples, Go and most other languages don't work on it either.
> >> Docker
> >> >>>>>>>> containers also address packaging, OS dependencies, conflicting
> >> versions
> >> >>>>>>>> and distribution aspects in addition to truly universal
> language
> >> support.
>
>
> >> >>>>>>>> >>>>>> This is wrong, docker also has its conflicts, is not
> >> universal
> >> >>>>>>>> (fails on windows and mac easily - as host or not, cloud
> vendors
> >> put layers
> >> >>>>>>>> limiting or corrupting it, and it is an infra constraint
> imposed
> >> and a
> >> >>>>>>>> vendor locking not welcomed in beam IMHO).
>
> >> >>>>>>>> >>>>>> This is my main concern. All the work done looks like an
> >> >>>>>>>> implemzntation detail of one runner+vendor corrupting all the
> >> project and
> >> >>>>>>>> adding complexity and work to everyone instead of keeping it
> >> localised
> >> >>>>>>>> (technically it is possible).
>
> >> >>>>>>>> >>>>>> Would you accept i enforce you to use selinux? Using
> docker
> >> is the
> >> >>>>>>>> same kind of constraint.
>
>
> >> >>>>>>>> >>>>>> That said, it's entirely fine for some runners to use
> >> Jython, Graal,
> >> >>>>>>>> etc to provide a specialized offering similar to the direct
> >> runners, but it
> >> >>>>>>>> would be disjoint from portability IMO.
>
> >> >>>>>>>> >>>>>> On Fri, May 4, 2018 at 10:14 AM Romain Manni-Bucau <
> >> >>>>>>>> rmannibu...@gmail.com> wrote:
>
>
>
> >> >>>>>>>> >>>>>>> Le 4 mai 2018 17:55, "Lukasz Cwik" <lc...@google.com>
> a
> >> écrit :
>
> >> >>>>>>>> >>>>>>> I did take a look at Graal a while back when thinking
> >> about how
> >> >>>>>>>> execution environments could be defined, my concerns were
> related
> >> to it not
> >> >>>>>>>> supporting all of the features of a language.
> >> >>>>>>>> >>>>>>> For example, its typical for Python to load and call
> native
> >> >>>>>>>> libraries and Graal can only execute C/C++ code that has been
> >> compiled to
> >> >>>>>>>> LLVM.
> >> >>>>>>>> >>>>>>> Also, a good amount of people interested in using ML
> >> libraries will
> >> >>>>>>>> want access to GPUs to improve performance which I believe that
> >> Graal can't
> >> >>>>>>>> support.
>
> >> >>>>>>>> >>>>>>> It can be a very useful way to run simple lamda
> functions
> >> written
> >> >>>>>>>> in some language directly without needing to use a docker
> >> environment but
> >> >>>>>>>> you could probably use something even lighter weight then Graal
> >> that is
> >> >>>>>>>> language specific like Jython.
>
>
>
> >> >>>>>>>> >>>>>>> Right, the jsr223 impl works very well but you can also
> >> have a perf
> >> >>>>>>>> boost using native (like v8 java binding for js for instance).
> It
> >> is way
> >> >>>>>>>> more efficient than docker most of the time and not code
> intrusive
> >> at all
> >> >>>>>>>> in runners so likely more adoption-able and maintainable. That
> >> said all is
> >> >>>>>>>> doable behind the jsr223 so maybe not a big deal in terms of
> api.
> >> We just
> >> >>>>>>>> need to ensure portability work stay clean and actually
> portable
> >> and doesnt
> >> >>>>>>>> impact runners as poc done until today did.
>
> >> >>>>>>>> >>>>>>> Works for me.
>
>
> >> >>>>>>>> >>>>>>> On Thu, May 3, 2018 at 10:05 PM Romain Manni-Bucau <
> >> >>>>>>>> rmannibu...@gmail.com> wrote:
>
> >> >>>>>>>> >>>>>>>> Hi guys
>
> >> >>>>>>>> >>>>>>>> Since some time there are efforts to have a language
> >> portable
> >> >>>>>>>> support in beam but I cant really find a case it "works" being
> >> based on
> >> >>>>>>>> docker except for some vendor specific infra.
>
> >> >>>>>>>> >>>>>>>> Current solution:
>
> >> >>>>>>>> >>>>>>>> 1. Is runner intrusive (which is bad for beam and
> prevents
> >> >>>>>>>> adoption of big data vendors)
> >> >>>>>>>> >>>>>>>> 2. Based on docker (which assumed a runtime
> environment
> >> and is
> >> >>>>>>>> very ops/infra intrusive and likely too $$ quite often for what
> it
> >> brings)
>
> >> >>>>>>>> >>>>>>>> Did anyone had a look to graal which seems a way to
> make
> >> the
> >> >>>>>>>> feature doable in a lighter manner and optimized compared to
> >> default jsr223
> >> >>>>>>>> impls?
>

Reply via email to