Thanks for your email, Romain. It helps understand your goals and where
you're coming from. I'd also like to see a thinner core, and agree it's
beneficial to reduce dependencies where possible, especially when
supporting the usecase where the pipeline is constructed in an environment
other than an end-user's main.

It seems a lot of the portability work, despite being on the surface driven
by multi-language, aligns well with many of these goals. For example, all
the work going on in runners-core to provide a rich library that all (Java,
and perhaps non-Java) runners can leverage to do DAG preprocessing (fusion,
combiner lifting, ...) and handle the low-level details of managing worker
subprocesses. As you state, the more we can put into these libraries, the
more all runners can get "for free" by interacting with them, while still
providing the flexibility to adapt to their differing models and strengths.

Getting this right is, for me at least, one of the highest priorities for
Beam.

- Robert
On Wed, May 16, 2018 at 11:51 AM Kenneth Knowles <k...@google.com> wrote:

> Hi Romain,

> This gives a clear view of your perspective. I also recommend you ask
around to those who have been working on Beam and big data processing for a
long time to learn more about their perspective.

> Your "Beam Analysis" is pretty accurate about what we've been trying to
build. I would say (a) & (b) as "any language on any runner" and (c) is our
plan of how to do it: define primitives which are fundamental to parallel
processing and formalize a language-independent representation, with
adapters for each language and data processing engine.

> Of course anyone in the community may have their own particular goal. We
don't control what they work on, and we are grateful for their efforts.

> Technically, there is plenty to agree with. I think as you learn about
Beam you will find that many of your suggestions are already handled in
some way. You may also continue to learn sometimes about the specific
reasons things are done in a different way than you expected. These should
help you find how to build what you want to build.

> Kenn

> On Wed, May 16, 2018 at 1:14 AM Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

>> Hi guys,

>> Since it is not the first time we have a thread where we end up not
understanding each other, I'd like to take this as an opportunity to
clarify what i'm looking for, in a more formal way. This assumes our
misunderstandings come from the fact I mainly tried to fix issues one by
ones, instead of painting the big picture I'm getting after. (My rational
was I was not able to invest more time in that but I start to think it was
not a good chocie). I really hope it helps.

>> 1. Beam analysis

>> Beam has three main goals:

>> a. Being a portable API accross runners (I also call them
"implementations" by opposition of "api")
>> b. Bringing some interoperability between languages and therefore users
>> c. Provide primitives (groupby for instance), I/O and generic processing
items

>> Indeed it doesn't cover all beam's features but, high level, it is what
it brings.

>> In terms of advantages and why choosing beam instead of spark, for
instance, the benefit is mainly to not be vendor locked on one side and to
enable more users on the other side (you note that point c is just catching
up on vendors ecosystems with these statements).

>> 2. Portable API accross environments

>> It is key, here, to keep in mind beam is not an environment or a runner.
It is by design, a library *embedded* in other environment.

>> a. This means that Beam must keep its stack as clean as possible. If it
is still ambiguous: beam must be dependency free.

>> Until now the workaround has been to shade dependencies. This is not a
solution since it leads to big jobs of hundreds of mega which prevents to
scale since we deploy from the network. It makes all deployments,
managements, and storage a pain on ops side. The other pitfall of shades
(or shadowing since we are on gradle now) is that it completely breaks any
company tooling and prevent vulnerability scanning or dependency upgrades -
not handled by dev team - to work correctly. This is a major issue for any
software targetting some professional level which should not be
underestimated.

>>  From that point we can get scared but with Java 8 there is no real point
having a tons of dependencies for the sdk core - this is for java but
should be true for most languages since beam requirements are light here.

>> However it can also require to rethink the sdk core modularity: why is
there some IO here? Do we need a big fat sdk core?

>> b. API or "put it all"?

>> Current API is in sdk-core but actually it prevents a modular
development since there are primitives and some IO in the core. What would
be sane is to extract the actual API from the core and get a beam-api. This
way we match all kind of user consumes:

>> - IO developers (they only need the SDF)
>> - pipeline writers (they only need the pipeline + IO)
>> - etc...

>> To make it an API it requires some changes but nothing crazy probably
and it would make beam more consumable and potentially reusable in other
environments.

>> I'll not detail the API points here since it is not the goal (think I
tracked most of them in
https://gist.github.com/rmannibucau/ab7543c23b6f57af921d98639fbcd436 if you
are interested)

>> c. Environment is not only about jars

>> Beam has two main execution environments:

>> - the "pipeline.run" one
>> - the pipeline execution (runner)

>> The last one is quite known and already has some challenges:

>> - can be a main execution so nothing crazy to manage
>> - can use subclassloaders to execute jobs, scale and isolate jobs
>> - etc... (we can think to an OSGi flavor for instance)

>> The first one is way more challenging since you must match:

>> - flat mains
>> - JavaEE containers
>> - OSGi containers
>> - custom weird environments (spring boot jar launcher)
>> - ...

>> This all lead to two very key consequences and programming rule respect:

>> - lifecycle: any component must ensure its lifecycle is very well
respected (we must avoid "JVM will clean up anyway" kind of thinking)
>> - no blind cache or static abuse, this must fit *all* environments
(pipelineoptionsfacctory is a good example of that)

>> 3. Make it hurtless for integrators/community

>> Beam's success is bound to the fact runners exist. A concern which is
quite important is that beam keeps adding features and say "runners will
implement them". I'd like that each time you think that you ask yourself
"does it need?".

>> I'll take two examples:

>> - the language portable support: there is no need to do it in all
runners, you can have a generic runner delegating to the right
implementation@runner the tasks and therefore, adding language portability
feature, you support OOTB all existing runners without impacting them
>> - the metrics pusher: this one got some discussion and lead to a polling
implementation which doesn't work in all runners not having a waiting
"driver" (hazelcast, spark in client mode etc...). Now it is going to be
added to the portable API if I got it right...if you think about it, you
can just instrument the pipeline by modifying the DAG before translating it
and therefore work on all runners for free as well.

>> These two simple examples show that the work should probably be done on
adding DAG preprocessors (sorted) and runner as something enrichable,
rather than with ad-hoc solutions for each feature.

>> 4. Be more reactive

>> If you check I/O, most of them can support asynchronous handling. The
gain is to be aligned on the actual I/O and not only be asynchronous to
starts a new thread. Using that allows to scale way more and use more
efficiently resources of the machine.

>> However it has a big pitfall: the whole programming model must be
reactive. Indeed, we can support a conversion from a not reactive to a
reactive model implicitly for simple case (think to a DoFn multiplying by 2
an int) but the I/O should be reactive and beam should be reactive in its
completion to benefit from it.



>> Summary: if I try to summarize this mail which tries to share the
philosophy I'm approaching beam with, more than particular issues, i'd say
that I strongly think, that to be a success, Beam but embrace what it is: a
portable layer on top of existing implementations. It means that it must
define a clear and minimal API for each kind of usage and probably expose
it by user kind (so actually N api). it must embrace the environments it
runs in and assume the constraints it brings. And finally it should be less
intrusive in all its layers and try to add features more transversally when
possible (and it is possible in a lot of cases). If you bring features for
free with new releases, everybody wins, if you announce features and no
runner support it, then you loose (and loose users).



>> Hope it helps,
>> Romain Manni-Bucau
>> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book

Reply via email to