Re: Beam high level directions (was "Graal instead of docker?")

Romain Manni-Bucau Thu, 17 May 2018 13:40:37 -0700

All runners just provide translations so it is easy to build features on
top of primitives, ie basic translations, instead of requiring runners to
use the same lib which is not yet done and will likely not be done when
adding new reusable parts - keep in mind beam starts to have runners not
hosted at beam.


Having a runner able to do that is a more elegant and robust way than the
fallback which consists to let the user call the pipeline visit to
translate the dag N times but it leads to almost the same at a few specific
exceptions.

Le jeu. 17 mai 2018 21:48, Kenneth Knowles <[email protected]> a écrit :

> If all engines were identical, having a shared optimizer would be useful. 
> Having
> a proxy runner that performance optimizations before submission to an
> actual engine-specific runner has downsides in both directions:
>
>  - obscures the ability of engine-specific runners to optimize the Beam
> primitives because they only receive post-optimized graph
>  - has to be extremely conservative in its optimizations because it does
> not know about the semantics of the underlying engine
>
> Building it as libraries let's engine-specific runners do what is best for
> their engine, while still maximizing reuse.
>
> Kenn
>
> On Thu, May 17, 2018 at 11:43 AM Robert Burke <[email protected]> wrote:
>
>> The approach you're looking for sounds like the user's Runner of Choice,
>> would use a user side version of the runner core, without changing the
>> Runner of Choice?
>>
>> So a user would update their version of the SDK, and the runner would
>> have to pull the core component from the user pipeline?
>>
>> That sounds like it increases pipeline size and decreases pipeline
>> portability, especially for pipelines that are not in the same language as
>> the runner-core, such as for Python and Go.
>>
>> It's not clear to me what runners would be doing in that scenario either.
>> Do you have a proposal about where the interface boundaries would be?
>>
>> On Wed, May 16, 2018, 10:05 PM Romain Manni-Bucau <[email protected]>
>> wrote:
>>
>>> The runner core doesnt fully align on that or rephrased more accurately,
>>> it doesnt go as far as it could for me. Having to call it, is still an
>>> issue since it requires a runner update instead of getting the new feature
>>> for free. The next step sounds to be *one* runner where implementations
>>> plug their translations probably. It would reverse the current pattern and
>>> prepare beam for the future. One good example of such implementation is the
>>> sdf which can "just" reuse dofn primitives to wire its support through
>>> runners.
>>>
>>> Le jeu. 17 mai 2018 02:01, Jesse Anderson <[email protected]> a
>>> écrit :
>>>
>>>> This -> "I'd like that each time you think that you ask yourself "does
>>>> it need?"."
>>>>
>>>> On Wed, May 16, 2018 at 4:53 PM Robert Bradshaw <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks for your email, Romain. It helps understand your goals and where
>>>>> you're coming from. I'd also like to see a thinner core, and agree it's
>>>>> beneficial to reduce dependencies where possible, especially when
>>>>> supporting the usecase where the pipeline is constructed in an
>>>>> environment
>>>>> other than an end-user's main.
>>>>>
>>>>> It seems a lot of the portability work, despite being on the surface
>>>>> driven
>>>>> by multi-language, aligns well with many of these goals. For example,
>>>>> all
>>>>> the work going on in runners-core to provide a rich library that all
>>>>> (Java,
>>>>> and perhaps non-Java) runners can leverage to do DAG preprocessing
>>>>> (fusion,
>>>>> combiner lifting, ...) and handle the low-level details of managing
>>>>> worker
>>>>> subprocesses. As you state, the more we can put into these libraries,
>>>>> the
>>>>> more all runners can get "for free" by interacting with them, while
>>>>> still
>>>>> providing the flexibility to adapt to their differing models and
>>>>> strengths.
>>>>>
>>>>> Getting this right is, for me at least, one of the highest priorities
>>>>> for
>>>>> Beam.
>>>>>
>>>>> - Robert
>>>>> On Wed, May 16, 2018 at 11:51 AM Kenneth Knowles <[email protected]>
>>>>> wrote:
>>>>>
>>>>> > Hi Romain,
>>>>>
>>>>> > This gives a clear view of your perspective. I also recommend you ask
>>>>> around to those who have been working on Beam and big data processing
>>>>> for a
>>>>> long time to learn more about their perspective.
>>>>>
>>>>> > Your "Beam Analysis" is pretty accurate about what we've been trying
>>>>> to
>>>>> build. I would say (a) & (b) as "any language on any runner" and (c)
>>>>> is our
>>>>> plan of how to do it: define primitives which are fundamental to
>>>>> parallel
>>>>> processing and formalize a language-independent representation, with
>>>>> adapters for each language and data processing engine.
>>>>>
>>>>> > Of course anyone in the community may have their own particular
>>>>> goal. We
>>>>> don't control what they work on, and we are grateful for their efforts.
>>>>>
>>>>> > Technically, there is plenty to agree with. I think as you learn
>>>>> about
>>>>> Beam you will find that many of your suggestions are already handled in
>>>>> some way. You may also continue to learn sometimes about the specific
>>>>> reasons things are done in a different way than you expected. These
>>>>> should
>>>>> help you find how to build what you want to build.
>>>>>
>>>>> > Kenn
>>>>>
>>>>> > On Wed, May 16, 2018 at 1:14 AM Romain Manni-Bucau <
>>>>> [email protected]>
>>>>> wrote:
>>>>>
>>>>> >> Hi guys,
>>>>>
>>>>> >> Since it is not the first time we have a thread where we end up not
>>>>> understanding each other, I'd like to take this as an opportunity to
>>>>> clarify what i'm looking for, in a more formal way. This assumes our
>>>>> misunderstandings come from the fact I mainly tried to fix issues one
>>>>> by
>>>>> ones, instead of painting the big picture I'm getting after. (My
>>>>> rational
>>>>> was I was not able to invest more time in that but I start to think it
>>>>> was
>>>>> not a good chocie). I really hope it helps.
>>>>>
>>>>> >> 1. Beam analysis
>>>>>
>>>>> >> Beam has three main goals:
>>>>>
>>>>> >> a. Being a portable API accross runners (I also call them
>>>>> "implementations" by opposition of "api")
>>>>> >> b. Bringing some interoperability between languages and therefore
>>>>> users
>>>>> >> c. Provide primitives (groupby for instance), I/O and generic
>>>>> processing
>>>>> items
>>>>>
>>>>> >> Indeed it doesn't cover all beam's features but, high level, it is
>>>>> what
>>>>> it brings.
>>>>>
>>>>> >> In terms of advantages and why choosing beam instead of spark, for
>>>>> instance, the benefit is mainly to not be vendor locked on one side
>>>>> and to
>>>>> enable more users on the other side (you note that point c is just
>>>>> catching
>>>>> up on vendors ecosystems with these statements).
>>>>>
>>>>> >> 2. Portable API accross environments
>>>>>
>>>>> >> It is key, here, to keep in mind beam is not an environment or a
>>>>> runner.
>>>>> It is by design, a library *embedded* in other environment.
>>>>>
>>>>> >> a. This means that Beam must keep its stack as clean as possible.
>>>>> If it
>>>>> is still ambiguous: beam must be dependency free.
>>>>>
>>>>> >> Until now the workaround has been to shade dependencies. This is
>>>>> not a
>>>>> solution since it leads to big jobs of hundreds of mega which prevents
>>>>> to
>>>>> scale since we deploy from the network. It makes all deployments,
>>>>> managements, and storage a pain on ops side. The other pitfall of
>>>>> shades
>>>>> (or shadowing since we are on gradle now) is that it completely breaks
>>>>> any
>>>>> company tooling and prevent vulnerability scanning or dependency
>>>>> upgrades -
>>>>> not handled by dev team - to work correctly. This is a major issue for
>>>>> any
>>>>> software targetting some professional level which should not be
>>>>> underestimated.
>>>>>
>>>>> >>  From that point we can get scared but with Java 8 there is no real
>>>>> point
>>>>> having a tons of dependencies for the sdk core - this is for java but
>>>>> should be true for most languages since beam requirements are light
>>>>> here.
>>>>>
>>>>> >> However it can also require to rethink the sdk core modularity: why
>>>>> is
>>>>> there some IO here? Do we need a big fat sdk core?
>>>>>
>>>>> >> b. API or "put it all"?
>>>>>
>>>>> >> Current API is in sdk-core but actually it prevents a modular
>>>>> development since there are primitives and some IO in the core. What
>>>>> would
>>>>> be sane is to extract the actual API from the core and get a beam-api.
>>>>> This
>>>>> way we match all kind of user consumes:
>>>>>
>>>>> >> - IO developers (they only need the SDF)
>>>>> >> - pipeline writers (they only need the pipeline + IO)
>>>>> >> - etc...
>>>>>
>>>>> >> To make it an API it requires some changes but nothing crazy
>>>>> probably
>>>>> and it would make beam more consumable and potentially reusable in
>>>>> other
>>>>> environments.
>>>>>
>>>>> >> I'll not detail the API points here since it is not the goal (think
>>>>> I
>>>>> tracked most of them in
>>>>> https://gist.github.com/rmannibucau/ab7543c23b6f57af921d98639fbcd436
>>>>> if you
>>>>> are interested)
>>>>>
>>>>> >> c. Environment is not only about jars
>>>>>
>>>>> >> Beam has two main execution environments:
>>>>>
>>>>> >> - the "pipeline.run" one
>>>>> >> - the pipeline execution (runner)
>>>>>
>>>>> >> The last one is quite known and already has some challenges:
>>>>>
>>>>> >> - can be a main execution so nothing crazy to manage
>>>>> >> - can use subclassloaders to execute jobs, scale and isolate jobs
>>>>> >> - etc... (we can think to an OSGi flavor for instance)
>>>>>
>>>>> >> The first one is way more challenging since you must match:
>>>>>
>>>>> >> - flat mains
>>>>> >> - JavaEE containers
>>>>> >> - OSGi containers
>>>>> >> - custom weird environments (spring boot jar launcher)
>>>>> >> - ...
>>>>>
>>>>> >> This all lead to two very key consequences and programming rule
>>>>> respect:
>>>>>
>>>>> >> - lifecycle: any component must ensure its lifecycle is very well
>>>>> respected (we must avoid "JVM will clean up anyway" kind of thinking)
>>>>> >> - no blind cache or static abuse, this must fit *all* environments
>>>>> (pipelineoptionsfacctory is a good example of that)
>>>>>
>>>>> >> 3. Make it hurtless for integrators/community
>>>>>
>>>>> >> Beam's success is bound to the fact runners exist. A concern which
>>>>> is
>>>>> quite important is that beam keeps adding features and say "runners
>>>>> will
>>>>> implement them". I'd like that each time you think that you ask
>>>>> yourself
>>>>> "does it need?".
>>>>>
>>>>> >> I'll take two examples:
>>>>>
>>>>> >> - the language portable support: there is no need to do it in all
>>>>> runners, you can have a generic runner delegating to the right
>>>>> implementation@runner the tasks and therefore, adding language
>>>>> portability
>>>>> feature, you support OOTB all existing runners without impacting them
>>>>> >> - the metrics pusher: this one got some discussion and lead to a
>>>>> polling
>>>>> implementation which doesn't work in all runners not having a waiting
>>>>> "driver" (hazelcast, spark in client mode etc...). Now it is going to
>>>>> be
>>>>> added to the portable API if I got it right...if you think about it,
>>>>> you
>>>>> can just instrument the pipeline by modifying the DAG before
>>>>> translating it
>>>>> and therefore work on all runners for free as well.
>>>>>
>>>>> >> These two simple examples show that the work should probably be
>>>>> done on
>>>>> adding DAG preprocessors (sorted) and runner as something enrichable,
>>>>> rather than with ad-hoc solutions for each feature.
>>>>>
>>>>> >> 4. Be more reactive
>>>>>
>>>>> >> If you check I/O, most of them can support asynchronous handling.
>>>>> The
>>>>> gain is to be aligned on the actual I/O and not only be asynchronous to
>>>>> starts a new thread. Using that allows to scale way more and use more
>>>>> efficiently resources of the machine.
>>>>>
>>>>> >> However it has a big pitfall: the whole programming model must be
>>>>> reactive. Indeed, we can support a conversion from a not reactive to a
>>>>> reactive model implicitly for simple case (think to a DoFn multiplying
>>>>> by 2
>>>>> an int) but the I/O should be reactive and beam should be reactive in
>>>>> its
>>>>> completion to benefit from it.
>>>>>
>>>>>
>>>>>
>>>>> >> Summary: if I try to summarize this mail which tries to share the
>>>>> philosophy I'm approaching beam with, more than particular issues, i'd
>>>>> say
>>>>> that I strongly think, that to be a success, Beam but embrace what it
>>>>> is: a
>>>>> portable layer on top of existing implementations. It means that it
>>>>> must
>>>>> define a clear and minimal API for each kind of usage and probably
>>>>> expose
>>>>> it by user kind (so actually N api). it must embrace the environments
>>>>> it
>>>>> runs in and assume the constraints it brings. And finally it should be
>>>>> less
>>>>> intrusive in all its layers and try to add features more transversally
>>>>> when
>>>>> possible (and it is possible in a lot of cases). If you bring features
>>>>> for
>>>>> free with new releases, everybody wins, if you announce features and no
>>>>> runner support it, then you loose (and loose users).
>>>>>
>>>>>
>>>>>
>>>>> >> Hope it helps,
>>>>> >> Romain Manni-Bucau
>>>>> >> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>>>>>
>>>>

Re: Beam high level directions (was "Graal instead of docker?")

Reply via email to