Re: Beam high level directions (was "Graal instead of docker?")

Kenneth Knowles Thu, 17 May 2018 12:48:53 -0700

If all engines were identical, having a shared optimizer would be
useful. Having
a proxy runner that performance optimizations before submission to an
actual engine-specific runner has downsides in both directions:


 - obscures the ability of engine-specific runners to optimize the Beam
primitives because they only receive post-optimized graph
 - has to be extremely conservative in its optimizations because it does
not know about the semantics of the underlying engine

Building it as libraries let's engine-specific runners do what is best for
their engine, while still maximizing reuse.

Kenn

On Thu, May 17, 2018 at 11:43 AM Robert Burke <rob...@frantil.com> wrote:

> The approach you're looking for sounds like the user's Runner of Choice,
> would use a user side version of the runner core, without changing the
> Runner of Choice?
>
> So a user would update their version of the SDK, and the runner would have
> to pull the core component from the user pipeline?
>
> That sounds like it increases pipeline size and decreases pipeline
> portability, especially for pipelines that are not in the same language as
> the runner-core, such as for Python and Go.
>
> It's not clear to me what runners would be doing in that scenario either.
> Do you have a proposal about where the interface boundaries would be?
>
> On Wed, May 16, 2018, 10:05 PM Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> The runner core doesnt fully align on that or rephrased more accurately,
>> it doesnt go as far as it could for me. Having to call it, is still an
>> issue since it requires a runner update instead of getting the new feature
>> for free. The next step sounds to be *one* runner where implementations
>> plug their translations probably. It would reverse the current pattern and
>> prepare beam for the future. One good example of such implementation is the
>> sdf which can "just" reuse dofn primitives to wire its support through
>> runners.
>>
>> Le jeu. 17 mai 2018 02:01, Jesse Anderson <je...@bigdatainstitute.io> a
>> écrit :
>>
>>> This -> "I'd like that each time you think that you ask yourself "does
>>> it need?"."
>>>
>>> On Wed, May 16, 2018 at 4:53 PM Robert Bradshaw <rober...@google.com>
>>> wrote:
>>>
>>>> Thanks for your email, Romain. It helps understand your goals and where
>>>> you're coming from. I'd also like to see a thinner core, and agree it's
>>>> beneficial to reduce dependencies where possible, especially when
>>>> supporting the usecase where the pipeline is constructed in an
>>>> environment
>>>> other than an end-user's main.
>>>>
>>>> It seems a lot of the portability work, despite being on the surface
>>>> driven
>>>> by multi-language, aligns well with many of these goals. For example,
>>>> all
>>>> the work going on in runners-core to provide a rich library that all
>>>> (Java,
>>>> and perhaps non-Java) runners can leverage to do DAG preprocessing
>>>> (fusion,
>>>> combiner lifting, ...) and handle the low-level details of managing
>>>> worker
>>>> subprocesses. As you state, the more we can put into these libraries,
>>>> the
>>>> more all runners can get "for free" by interacting with them, while
>>>> still
>>>> providing the flexibility to adapt to their differing models and
>>>> strengths.
>>>>
>>>> Getting this right is, for me at least, one of the highest priorities
>>>> for
>>>> Beam.
>>>>
>>>> - Robert
>>>> On Wed, May 16, 2018 at 11:51 AM Kenneth Knowles <k...@google.com>
>>>> wrote:
>>>>
>>>> > Hi Romain,
>>>>
>>>> > This gives a clear view of your perspective. I also recommend you ask
>>>> around to those who have been working on Beam and big data processing
>>>> for a
>>>> long time to learn more about their perspective.
>>>>
>>>> > Your "Beam Analysis" is pretty accurate about what we've been trying
>>>> to
>>>> build. I would say (a) & (b) as "any language on any runner" and (c) is
>>>> our
>>>> plan of how to do it: define primitives which are fundamental to
>>>> parallel
>>>> processing and formalize a language-independent representation, with
>>>> adapters for each language and data processing engine.
>>>>
>>>> > Of course anyone in the community may have their own particular goal.
>>>> We
>>>> don't control what they work on, and we are grateful for their efforts.
>>>>
>>>> > Technically, there is plenty to agree with. I think as you learn about
>>>> Beam you will find that many of your suggestions are already handled in
>>>> some way. You may also continue to learn sometimes about the specific
>>>> reasons things are done in a different way than you expected. These
>>>> should
>>>> help you find how to build what you want to build.
>>>>
>>>> > Kenn
>>>>
>>>> > On Wed, May 16, 2018 at 1:14 AM Romain Manni-Bucau <
>>>> rmannibu...@gmail.com>
>>>> wrote:
>>>>
>>>> >> Hi guys,
>>>>
>>>> >> Since it is not the first time we have a thread where we end up not
>>>> understanding each other, I'd like to take this as an opportunity to
>>>> clarify what i'm looking for, in a more formal way. This assumes our
>>>> misunderstandings come from the fact I mainly tried to fix issues one by
>>>> ones, instead of painting the big picture I'm getting after. (My
>>>> rational
>>>> was I was not able to invest more time in that but I start to think it
>>>> was
>>>> not a good chocie). I really hope it helps.
>>>>
>>>> >> 1. Beam analysis
>>>>
>>>> >> Beam has three main goals:
>>>>
>>>> >> a. Being a portable API accross runners (I also call them
>>>> "implementations" by opposition of "api")
>>>> >> b. Bringing some interoperability between languages and therefore
>>>> users
>>>> >> c. Provide primitives (groupby for instance), I/O and generic
>>>> processing
>>>> items
>>>>
>>>> >> Indeed it doesn't cover all beam's features but, high level, it is
>>>> what
>>>> it brings.
>>>>
>>>> >> In terms of advantages and why choosing beam instead of spark, for
>>>> instance, the benefit is mainly to not be vendor locked on one side and
>>>> to
>>>> enable more users on the other side (you note that point c is just
>>>> catching
>>>> up on vendors ecosystems with these statements).
>>>>
>>>> >> 2. Portable API accross environments
>>>>
>>>> >> It is key, here, to keep in mind beam is not an environment or a
>>>> runner.
>>>> It is by design, a library *embedded* in other environment.
>>>>
>>>> >> a. This means that Beam must keep its stack as clean as possible. If
>>>> it
>>>> is still ambiguous: beam must be dependency free.
>>>>
>>>> >> Until now the workaround has been to shade dependencies. This is not
>>>> a
>>>> solution since it leads to big jobs of hundreds of mega which prevents
>>>> to
>>>> scale since we deploy from the network. It makes all deployments,
>>>> managements, and storage a pain on ops side. The other pitfall of shades
>>>> (or shadowing since we are on gradle now) is that it completely breaks
>>>> any
>>>> company tooling and prevent vulnerability scanning or dependency
>>>> upgrades -
>>>> not handled by dev team - to work correctly. This is a major issue for
>>>> any
>>>> software targetting some professional level which should not be
>>>> underestimated.
>>>>
>>>> >>  From that point we can get scared but with Java 8 there is no real
>>>> point
>>>> having a tons of dependencies for the sdk core - this is for java but
>>>> should be true for most languages since beam requirements are light
>>>> here.
>>>>
>>>> >> However it can also require to rethink the sdk core modularity: why
>>>> is
>>>> there some IO here? Do we need a big fat sdk core?
>>>>
>>>> >> b. API or "put it all"?
>>>>
>>>> >> Current API is in sdk-core but actually it prevents a modular
>>>> development since there are primitives and some IO in the core. What
>>>> would
>>>> be sane is to extract the actual API from the core and get a beam-api.
>>>> This
>>>> way we match all kind of user consumes:
>>>>
>>>> >> - IO developers (they only need the SDF)
>>>> >> - pipeline writers (they only need the pipeline + IO)
>>>> >> - etc...
>>>>
>>>> >> To make it an API it requires some changes but nothing crazy probably
>>>> and it would make beam more consumable and potentially reusable in other
>>>> environments.
>>>>
>>>> >> I'll not detail the API points here since it is not the goal (think I
>>>> tracked most of them in
>>>> https://gist.github.com/rmannibucau/ab7543c23b6f57af921d98639fbcd436
>>>> if you
>>>> are interested)
>>>>
>>>> >> c. Environment is not only about jars
>>>>
>>>> >> Beam has two main execution environments:
>>>>
>>>> >> - the "pipeline.run" one
>>>> >> - the pipeline execution (runner)
>>>>
>>>> >> The last one is quite known and already has some challenges:
>>>>
>>>> >> - can be a main execution so nothing crazy to manage
>>>> >> - can use subclassloaders to execute jobs, scale and isolate jobs
>>>> >> - etc... (we can think to an OSGi flavor for instance)
>>>>
>>>> >> The first one is way more challenging since you must match:
>>>>
>>>> >> - flat mains
>>>> >> - JavaEE containers
>>>> >> - OSGi containers
>>>> >> - custom weird environments (spring boot jar launcher)
>>>> >> - ...
>>>>
>>>> >> This all lead to two very key consequences and programming rule
>>>> respect:
>>>>
>>>> >> - lifecycle: any component must ensure its lifecycle is very well
>>>> respected (we must avoid "JVM will clean up anyway" kind of thinking)
>>>> >> - no blind cache or static abuse, this must fit *all* environments
>>>> (pipelineoptionsfacctory is a good example of that)
>>>>
>>>> >> 3. Make it hurtless for integrators/community
>>>>
>>>> >> Beam's success is bound to the fact runners exist. A concern which is
>>>> quite important is that beam keeps adding features and say "runners will
>>>> implement them". I'd like that each time you think that you ask yourself
>>>> "does it need?".
>>>>
>>>> >> I'll take two examples:
>>>>
>>>> >> - the language portable support: there is no need to do it in all
>>>> runners, you can have a generic runner delegating to the right
>>>> implementation@runner the tasks and therefore, adding language
>>>> portability
>>>> feature, you support OOTB all existing runners without impacting them
>>>> >> - the metrics pusher: this one got some discussion and lead to a
>>>> polling
>>>> implementation which doesn't work in all runners not having a waiting
>>>> "driver" (hazelcast, spark in client mode etc...). Now it is going to be
>>>> added to the portable API if I got it right...if you think about it, you
>>>> can just instrument the pipeline by modifying the DAG before
>>>> translating it
>>>> and therefore work on all runners for free as well.
>>>>
>>>> >> These two simple examples show that the work should probably be done
>>>> on
>>>> adding DAG preprocessors (sorted) and runner as something enrichable,
>>>> rather than with ad-hoc solutions for each feature.
>>>>
>>>> >> 4. Be more reactive
>>>>
>>>> >> If you check I/O, most of them can support asynchronous handling. The
>>>> gain is to be aligned on the actual I/O and not only be asynchronous to
>>>> starts a new thread. Using that allows to scale way more and use more
>>>> efficiently resources of the machine.
>>>>
>>>> >> However it has a big pitfall: the whole programming model must be
>>>> reactive. Indeed, we can support a conversion from a not reactive to a
>>>> reactive model implicitly for simple case (think to a DoFn multiplying
>>>> by 2
>>>> an int) but the I/O should be reactive and beam should be reactive in
>>>> its
>>>> completion to benefit from it.
>>>>
>>>>
>>>>
>>>> >> Summary: if I try to summarize this mail which tries to share the
>>>> philosophy I'm approaching beam with, more than particular issues, i'd
>>>> say
>>>> that I strongly think, that to be a success, Beam but embrace what it
>>>> is: a
>>>> portable layer on top of existing implementations. It means that it must
>>>> define a clear and minimal API for each kind of usage and probably
>>>> expose
>>>> it by user kind (so actually N api). it must embrace the environments it
>>>> runs in and assume the constraints it brings. And finally it should be
>>>> less
>>>> intrusive in all its layers and try to add features more transversally
>>>> when
>>>> possible (and it is possible in a lot of cases). If you bring features
>>>> for
>>>> free with new releases, everybody wins, if you announce features and no
>>>> runner support it, then you loose (and loose users).
>>>>
>>>>
>>>>
>>>> >> Hope it helps,
>>>> >> Romain Manni-Bucau
>>>> >> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>>>>
>>>

Re: Beam high level directions (was "Graal instead of docker?")

Reply via email to