Le ven. 14 sept. 2018 à 09:48, Robert Bradshaw <[email protected]
<mailto:[email protected]>> a écrit :
On Fri, Sep 14, 2018 at 8:00 AM Romain Manni-Bucau
<[email protected] <mailto:[email protected]>> wrote:
Well IBM runner is outside Beam for instance so this is not
really a point IMHO.
My view is simple:
1. does this module bring anything to Beam as a project: I
understand your answer as a no (please clarify if I'm wrong)
As has been mentioned, this make it easier for both developers at
google and developers outside google to contribute, which is the
immediate benefit. Longer term I also hope it leads to more code
sharing (there is currently an unnecessary amount of duplication due
to the pain of developing across this boundary) including of
features that aren't yet in upstream runners but we'd like to see
(e.g. liquid sharding).
This is half true, external dev will be able to contribute but not test
so no real gain here.
2. does this module bring anything to Beam or Big Data users:
same answer
Dataflow is used by many Beam users, making it work well is in their
interest as well. That which makes contributors lives easier (and
wastes less of their time) will translate into more contributions
(new features, faster bugfixes, ...) as well.
Same point, if you can contribute to something you can't test without
mocks then you still can't work on it reliably.
So at the end this will not bring anything to the community and
just solve an google internal design issue so why should it hit
Beam?
I get the "we can't test it" point but this is wrong since you
can use snapshots and staging repos, if not the enhancement is
trivial enough to make it doable and not add a dead module to
beam tree.
Am I missing anything?
While it's true we *can* test this without it being in Beam, as we
have been doing, it's painful. It's like doing away with presubmits
and only relying on postsubmits, but where you can't even look at
the failure and fix it on your local machine. It's a huge time sink
for all those involved, and not good for transparency or openness
(e.g. there are things that only googlers can do).
This is the case for any vendor impl based on Beam since by design the
dependency is in this direction.
As has been mentioned, we already do this for Flink, Spark, etc.
There's also a precedent for providing connectors to even non-OSS
systems, e.g. we ship the job submission portions for Dataflow, IO
connectors for Apache Kenisis, and an S3 filesystem adapter. It
certainly wouldn't be in our, or our users, benefit to remove those.
Agree but these are modules which are touching users directly. I.e. if
you have some S3 bucket you grab the module and run it on your database.
In the worker case, you will never do it.
Eventually, as has been mentioned on the other thread, I hope our
interfaces become stable enough that it's easy to move much if not
all of this into the respective upstream projects. But that is
certainly not the case right now.
This is likely where investment must be made instead of working it
around making beam bigger, increase its maintenance cost for the
community without real gain and harder to enter in.
Hopefully this helps answer your questions as to the benefits for Beam.
Le ven. 14 sept. 2018 à 07:22, Reuven Lax <[email protected]
<mailto:[email protected]>> a écrit :
Dataflow tests are part of Beam post submit, and if a PR
breaks the Dataflow runner it will probably be rolled back.
Today Beam contributors that make changes impacting the
runner boundary have no way to make those changes without
breaking Dataflow (unless they as a Googler to help them).
Fortunately these are not the most common changes, but they
happen, and it's caused a lot of pain in the past.
Putting this code into the github repository allows all
runners to be modified when such a change is made, not just
the non-Dataflow runners. This makes it easier for
contributors and committers to make changes to Beam.
Reuven
On Thu, Sep 13, 2018 at 10:08 PM Romain Manni-Bucau
<[email protected] <mailto:[email protected]>> wrote:
Flink, Spark, Apex are usable since they are OS so you
grab them+beam and you "run".
If I grab dataflow worker + X OS project and "run" it is
the same, however if I grab dataflow worker and cant do
anything with it, the added value for Beam and users is
pretty null, no? Just means Google should find another
way to manage this dependency if this is the case IMHO.
Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> | Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github
<https://github.com/rmannibucau> | LinkedIn
<https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>
Le jeu. 13 sept. 2018 à 23:35, Lukasz Cwik
<[email protected] <mailto:[email protected]>> a écrit :
Romain, the code is very similar to the adaptation
layer between the shared libraries part of Apache
Beam and any other runner, for example the code
within runners/spark or runners/apex or runners/flink.
If someone wanted to build an emulator of the
Dataflow service, they would be able to re-use them
but that is as impractical as writing an emulator
for Flink or Spark and plugging them in as the
dependency for runners/flink and runners/spark
respectively.
On Thu, Sep 13, 2018 at 2:07 PM Raghu Angadi
<[email protected] <mailto:[email protected]>> wrote:
On Thu, Sep 13, 2018 at 12:53 PM Romain
Manni-Bucau <[email protected]
<mailto:[email protected]>> wrote:
If usable by itself without google karma
(can you use a worker without dataflow
itself?) it sounds awesome otherwise it
sounds weird IMHO.
Can you elaborate a bit more on using worker
without dataflow? I essentially see that as o
part of Dataflow runner. A runner is specific to
a platform.
I am a Googler, but commenting as a community
member.
Raghu.
Le jeu. 13 sept. 2018 21:36, Kai Jiang
<[email protected]
<mailto:[email protected]>> a écrit :
+1 (non googler)
big help for transparency and for future
runners.
Best,
Kai
On Thu, Sep 13, 2018, 11:45 Xinyu Liu
<[email protected]
<mailto:[email protected]>> wrote:
Big +1 (non-googler).
From Samza Runner's perspective, we
are very happy to see dataflow
worker code so we can learn and
compete :).
Thanks,
Xinyu
On Thu, Sep 13, 2018 at 11:34 AM
Suneel Marthi
<[email protected]
<mailto:[email protected]>> wrote:
+1 (non-googler)
This is a great 👍 move
Sent from my iPhone
On Sep 13, 2018, at 2:25 PM, Tim
Robertson
<[email protected]
<mailto:[email protected]>>
wrote:
+1 (non googler)
It sounds pragmatic, helps
with transparency should
issues arise and enables more
people to fix.
On Thu, Sep 13, 2018 at 8:15
PM Dan Halperin
<[email protected]
<mailto:[email protected]>>
wrote:
From my perspective as a
(non-Google) community
member, huge +1.
I don't see anything bad
for the community about
open sourcing more of the
probably-most-used runner.
While the DirectRunner is
probably still the most
referential implementation
of Beam, can't hurt to see
more working code. Other
runners or runner
implementors can refer to
this code if they want,
and ignore it if they don't.
In terms of having more
code and tests to support,
well, that's par for the
course. Will this change
make the things that need
to be done to support them
more obvious? (E.g., "this
PR is blocked because
someone at Google on
Dataflow team has to fix
something" vs "this PR is
blocked because the Apache
Beam code in foo/bar/baz
is failing, and anyone who
can see the code can fix
it"). The latter seems
like a clear win for the
community.
(As long as the code
donation is handled
properly, but that's
completely orthogonal and
I have no reason to think
it wouldn't be.)
Thanks,
Dan
On Thu, Sep 13, 2018 at
11:06 AM Lukasz Cwik
<[email protected]
<mailto:[email protected]>>
wrote:
Yes, I'm specifically
asking the community
for opinions as to
whether it should be
accepted or not.
On Thu, Sep 13, 2018
at 10:51 AM Raghu
Angadi
<[email protected]
<mailto:[email protected]>>
wrote:
This is terrific!
Is thread asking
for opinions from
the community
about if it should
be accepted?
Assuming Google
side decision is
made to
contribute, big +1
from me to include
it next to other
runners.
On Thu, Sep 13,
2018 at 10:38 AM
Lukasz Cwik
<[email protected]
<mailto:[email protected]>>
wrote:
At Google we
have been
importing the
Apache Beam
code base and
integrating it
with the
Google portion
of the
codebase that
supports the
Dataflow
worker. This
process is
painful as we
regularly are
making
breaking API
changes to
support
libraries
related to
running
portable
pipelines (and
sometimes in
other places
as well). This
has made it
sometimes
difficult for
PR changes to
make changes
without either
breaking
something for
Google or
waiting for a
Googler to
make the
change
internally
(e.g.
dependency
updates).
This code is
very similar
to the other
integrations
that exist for
runners such
as
Flink/Spark/Apex/Samza.
It is an
adaption layer
that sits on
top of an
execution
engine. There
is no super
secret awesome
stuff as this
code was
already
publicly
visible in the
past when it
was part of
the Google
Cloud Dataflow
github repo[1].
Process wise
the code will
need to get
approval from
Google to be
donated and
for it to go
through the
code donation
process but
before we
attempt to do
that, I was
wondering
whether the
community
would object
to adding this
code to the
master branch?
The up side is
that people
can make
breaking
changes and
fix it for all
runners. It
will also help
Googlers
contribute
more to the
portability
story as it
will remove
the burden of
doing the code
import (wasted
time) and it
will allow
people to
develop in
master (can
have the whole
project loaded
in a single IDE).
The downsides
are that this
will represent
more code and
unit tests to
support.
1:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/tree/hotfix_v1.2/sdk/src/main/java/com/google/cloud/dataflow/sdk/runners/worker