If all engines were identical, having a shared optimizer would be useful. Having a proxy runner that performance optimizations before submission to an actual engine-specific runner has downsides in both directions:
- obscures the ability of engine-specific runners to optimize the Beam primitives because they only receive post-optimized graph - has to be extremely conservative in its optimizations because it does not know about the semantics of the underlying engine Building it as libraries let's engine-specific runners do what is best for their engine, while still maximizing reuse. Kenn On Thu, May 17, 2018 at 11:43 AM Robert Burke <rob...@frantil.com> wrote: > The approach you're looking for sounds like the user's Runner of Choice, > would use a user side version of the runner core, without changing the > Runner of Choice? > > So a user would update their version of the SDK, and the runner would have > to pull the core component from the user pipeline? > > That sounds like it increases pipeline size and decreases pipeline > portability, especially for pipelines that are not in the same language as > the runner-core, such as for Python and Go. > > It's not clear to me what runners would be doing in that scenario either. > Do you have a proposal about where the interface boundaries would be? > > On Wed, May 16, 2018, 10:05 PM Romain Manni-Bucau <rmannibu...@gmail.com> > wrote: > >> The runner core doesnt fully align on that or rephrased more accurately, >> it doesnt go as far as it could for me. Having to call it, is still an >> issue since it requires a runner update instead of getting the new feature >> for free. The next step sounds to be *one* runner where implementations >> plug their translations probably. It would reverse the current pattern and >> prepare beam for the future. One good example of such implementation is the >> sdf which can "just" reuse dofn primitives to wire its support through >> runners. >> >> Le jeu. 17 mai 2018 02:01, Jesse Anderson <je...@bigdatainstitute.io> a >> écrit : >> >>> This -> "I'd like that each time you think that you ask yourself "does >>> it need?"." >>> >>> On Wed, May 16, 2018 at 4:53 PM Robert Bradshaw <rober...@google.com> >>> wrote: >>> >>>> Thanks for your email, Romain. It helps understand your goals and where >>>> you're coming from. I'd also like to see a thinner core, and agree it's >>>> beneficial to reduce dependencies where possible, especially when >>>> supporting the usecase where the pipeline is constructed in an >>>> environment >>>> other than an end-user's main. >>>> >>>> It seems a lot of the portability work, despite being on the surface >>>> driven >>>> by multi-language, aligns well with many of these goals. For example, >>>> all >>>> the work going on in runners-core to provide a rich library that all >>>> (Java, >>>> and perhaps non-Java) runners can leverage to do DAG preprocessing >>>> (fusion, >>>> combiner lifting, ...) and handle the low-level details of managing >>>> worker >>>> subprocesses. As you state, the more we can put into these libraries, >>>> the >>>> more all runners can get "for free" by interacting with them, while >>>> still >>>> providing the flexibility to adapt to their differing models and >>>> strengths. >>>> >>>> Getting this right is, for me at least, one of the highest priorities >>>> for >>>> Beam. >>>> >>>> - Robert >>>> On Wed, May 16, 2018 at 11:51 AM Kenneth Knowles <k...@google.com> >>>> wrote: >>>> >>>> > Hi Romain, >>>> >>>> > This gives a clear view of your perspective. I also recommend you ask >>>> around to those who have been working on Beam and big data processing >>>> for a >>>> long time to learn more about their perspective. >>>> >>>> > Your "Beam Analysis" is pretty accurate about what we've been trying >>>> to >>>> build. I would say (a) & (b) as "any language on any runner" and (c) is >>>> our >>>> plan of how to do it: define primitives which are fundamental to >>>> parallel >>>> processing and formalize a language-independent representation, with >>>> adapters for each language and data processing engine. >>>> >>>> > Of course anyone in the community may have their own particular goal. >>>> We >>>> don't control what they work on, and we are grateful for their efforts. >>>> >>>> > Technically, there is plenty to agree with. I think as you learn about >>>> Beam you will find that many of your suggestions are already handled in >>>> some way. You may also continue to learn sometimes about the specific >>>> reasons things are done in a different way than you expected. These >>>> should >>>> help you find how to build what you want to build. >>>> >>>> > Kenn >>>> >>>> > On Wed, May 16, 2018 at 1:14 AM Romain Manni-Bucau < >>>> rmannibu...@gmail.com> >>>> wrote: >>>> >>>> >> Hi guys, >>>> >>>> >> Since it is not the first time we have a thread where we end up not >>>> understanding each other, I'd like to take this as an opportunity to >>>> clarify what i'm looking for, in a more formal way. This assumes our >>>> misunderstandings come from the fact I mainly tried to fix issues one by >>>> ones, instead of painting the big picture I'm getting after. (My >>>> rational >>>> was I was not able to invest more time in that but I start to think it >>>> was >>>> not a good chocie). I really hope it helps. >>>> >>>> >> 1. Beam analysis >>>> >>>> >> Beam has three main goals: >>>> >>>> >> a. Being a portable API accross runners (I also call them >>>> "implementations" by opposition of "api") >>>> >> b. Bringing some interoperability between languages and therefore >>>> users >>>> >> c. Provide primitives (groupby for instance), I/O and generic >>>> processing >>>> items >>>> >>>> >> Indeed it doesn't cover all beam's features but, high level, it is >>>> what >>>> it brings. >>>> >>>> >> In terms of advantages and why choosing beam instead of spark, for >>>> instance, the benefit is mainly to not be vendor locked on one side and >>>> to >>>> enable more users on the other side (you note that point c is just >>>> catching >>>> up on vendors ecosystems with these statements). >>>> >>>> >> 2. Portable API accross environments >>>> >>>> >> It is key, here, to keep in mind beam is not an environment or a >>>> runner. >>>> It is by design, a library *embedded* in other environment. >>>> >>>> >> a. This means that Beam must keep its stack as clean as possible. If >>>> it >>>> is still ambiguous: beam must be dependency free. >>>> >>>> >> Until now the workaround has been to shade dependencies. This is not >>>> a >>>> solution since it leads to big jobs of hundreds of mega which prevents >>>> to >>>> scale since we deploy from the network. It makes all deployments, >>>> managements, and storage a pain on ops side. The other pitfall of shades >>>> (or shadowing since we are on gradle now) is that it completely breaks >>>> any >>>> company tooling and prevent vulnerability scanning or dependency >>>> upgrades - >>>> not handled by dev team - to work correctly. This is a major issue for >>>> any >>>> software targetting some professional level which should not be >>>> underestimated. >>>> >>>> >> From that point we can get scared but with Java 8 there is no real >>>> point >>>> having a tons of dependencies for the sdk core - this is for java but >>>> should be true for most languages since beam requirements are light >>>> here. >>>> >>>> >> However it can also require to rethink the sdk core modularity: why >>>> is >>>> there some IO here? Do we need a big fat sdk core? >>>> >>>> >> b. API or "put it all"? >>>> >>>> >> Current API is in sdk-core but actually it prevents a modular >>>> development since there are primitives and some IO in the core. What >>>> would >>>> be sane is to extract the actual API from the core and get a beam-api. >>>> This >>>> way we match all kind of user consumes: >>>> >>>> >> - IO developers (they only need the SDF) >>>> >> - pipeline writers (they only need the pipeline + IO) >>>> >> - etc... >>>> >>>> >> To make it an API it requires some changes but nothing crazy probably >>>> and it would make beam more consumable and potentially reusable in other >>>> environments. >>>> >>>> >> I'll not detail the API points here since it is not the goal (think I >>>> tracked most of them in >>>> https://gist.github.com/rmannibucau/ab7543c23b6f57af921d98639fbcd436 >>>> if you >>>> are interested) >>>> >>>> >> c. Environment is not only about jars >>>> >>>> >> Beam has two main execution environments: >>>> >>>> >> - the "pipeline.run" one >>>> >> - the pipeline execution (runner) >>>> >>>> >> The last one is quite known and already has some challenges: >>>> >>>> >> - can be a main execution so nothing crazy to manage >>>> >> - can use subclassloaders to execute jobs, scale and isolate jobs >>>> >> - etc... (we can think to an OSGi flavor for instance) >>>> >>>> >> The first one is way more challenging since you must match: >>>> >>>> >> - flat mains >>>> >> - JavaEE containers >>>> >> - OSGi containers >>>> >> - custom weird environments (spring boot jar launcher) >>>> >> - ... >>>> >>>> >> This all lead to two very key consequences and programming rule >>>> respect: >>>> >>>> >> - lifecycle: any component must ensure its lifecycle is very well >>>> respected (we must avoid "JVM will clean up anyway" kind of thinking) >>>> >> - no blind cache or static abuse, this must fit *all* environments >>>> (pipelineoptionsfacctory is a good example of that) >>>> >>>> >> 3. Make it hurtless for integrators/community >>>> >>>> >> Beam's success is bound to the fact runners exist. A concern which is >>>> quite important is that beam keeps adding features and say "runners will >>>> implement them". I'd like that each time you think that you ask yourself >>>> "does it need?". >>>> >>>> >> I'll take two examples: >>>> >>>> >> - the language portable support: there is no need to do it in all >>>> runners, you can have a generic runner delegating to the right >>>> implementation@runner the tasks and therefore, adding language >>>> portability >>>> feature, you support OOTB all existing runners without impacting them >>>> >> - the metrics pusher: this one got some discussion and lead to a >>>> polling >>>> implementation which doesn't work in all runners not having a waiting >>>> "driver" (hazelcast, spark in client mode etc...). Now it is going to be >>>> added to the portable API if I got it right...if you think about it, you >>>> can just instrument the pipeline by modifying the DAG before >>>> translating it >>>> and therefore work on all runners for free as well. >>>> >>>> >> These two simple examples show that the work should probably be done >>>> on >>>> adding DAG preprocessors (sorted) and runner as something enrichable, >>>> rather than with ad-hoc solutions for each feature. >>>> >>>> >> 4. Be more reactive >>>> >>>> >> If you check I/O, most of them can support asynchronous handling. The >>>> gain is to be aligned on the actual I/O and not only be asynchronous to >>>> starts a new thread. Using that allows to scale way more and use more >>>> efficiently resources of the machine. >>>> >>>> >> However it has a big pitfall: the whole programming model must be >>>> reactive. Indeed, we can support a conversion from a not reactive to a >>>> reactive model implicitly for simple case (think to a DoFn multiplying >>>> by 2 >>>> an int) but the I/O should be reactive and beam should be reactive in >>>> its >>>> completion to benefit from it. >>>> >>>> >>>> >>>> >> Summary: if I try to summarize this mail which tries to share the >>>> philosophy I'm approaching beam with, more than particular issues, i'd >>>> say >>>> that I strongly think, that to be a success, Beam but embrace what it >>>> is: a >>>> portable layer on top of existing implementations. It means that it must >>>> define a clear and minimal API for each kind of usage and probably >>>> expose >>>> it by user kind (so actually N api). it must embrace the environments it >>>> runs in and assume the constraints it brings. And finally it should be >>>> less >>>> intrusive in all its layers and try to add features more transversally >>>> when >>>> possible (and it is possible in a lot of cases). If you bring features >>>> for >>>> free with new releases, everybody wins, if you announce features and no >>>> runner support it, then you loose (and loose users). >>>> >>>> >>>> >>>> >> Hope it helps, >>>> >> Romain Manni-Bucau >>>> >> @rmannibucau | Blog | Old Blog | Github | LinkedIn | Book >>>> >>>