Re: Donating the Dataflow Worker code to Apache Beam

Maximilian Michels Sun, 16 Sep 2018 03:20:02 -0700

If anything, merging the Dataflow Worker code shows Google's commitmentto the Beam project. Yes, it does solve internal issues with syncingtheir runtime with Beam, but Beam was always about the programming modelfor data processing, not about a specific type of execution engine.

Like any other execution engine, it requires effort from a party tosupport its Runner. If that proves not to be the case anymore, it willalso be an option to remove code again.

Overall, it should help to sync and develop the portability with therest of the community. And that helps to grow the popularity of Beam. Idon't see why we would not support it.


+1 from my side

On 14.09.18 10:02, Romain Manni-Bucau wrote:

Le ven. 14 sept. 2018 à 09:48, Robert Bradshaw <[email protected]<mailto:[email protected]>> a écrit :


    On Fri, Sep 14, 2018 at 8:00 AM Romain Manni-Bucau
    <[email protected] <mailto:[email protected]>> wrote:

        Well IBM runner is outside Beam for instance so this is not
        really a point IMHO.

        My view is simple:
        1. does this module bring anything to Beam as a project: I
        understand your answer as a no (please clarify if I'm wrong)


    As has been mentioned, this make it easier for both developers at
    google and developers outside google to contribute, which is the
    immediate benefit. Longer term I also hope it leads to more code
    sharing (there is currently an unnecessary amount of duplication due
    to the pain of developing across this boundary) including of
    features that aren't yet in upstream runners but we'd like to see

(e.g. liquid sharding).

This is half true, external dev will be able to contribute but not testso no real gain here.


        2. does this module bring anything to Beam or Big Data users:
        same answer


    Dataflow is used by many Beam users, making it work well is in their
    interest as well. That which makes contributors lives easier (and
    wastes less of their time) will translate into more contributions

(new features, faster bugfixes, ...) as well.

Same point, if you can contribute to something you can't test withoutmocks then you still can't work on it reliably.



        So at the end this will not bring anything to the community and
        just solve an google internal design issue so why should it hit
        Beam?
        I get the "we can't test it" point but this is wrong since you
        can use snapshots and staging repos, if not the enhancement is
        trivial enough to make it doable and not add a dead module to
        beam tree.

        Am I missing anything?


    While it's true we *can* test this without it being in Beam, as we
    have been doing, it's painful. It's like doing away with presubmits
    and only relying on postsubmits, but where you can't even look at
    the failure and fix it on your local machine. It's a huge time sink
    for all those involved, and not good for transparency or openness
    (e.g. there are things that only googlers can do).

This is the case for any vendor impl based on Beam since by design thedependency is in this direction.



    As has been mentioned, we already do this for Flink, Spark, etc.
    There's also a precedent for providing connectors to even non-OSS
    systems, e.g. we ship the job submission portions for Dataflow, IO
    connectors for Apache Kenisis, and an S3 filesystem adapter. It

certainly wouldn't be in our, or our users, benefit to remove those.

Agree but these are modules which are touching users directly. I.e. ifyou have some S3 bucket you grab the module and run it on your database.In the worker case, you will never do it.



    Eventually, as has been mentioned on the other thread, I hope our
    interfaces become stable enough that it's easy to move much if not
    all of this into the respective upstream projects. But that is

certainly not the case right now.

This is likely where investment must be made instead of working itaround making beam bigger, increase its maintenance cost for thecommunity without real gain and harder to enter in.



    Hopefully this helps answer your questions as to the benefits for Beam.

        Le ven. 14 sept. 2018 à 07:22, Reuven Lax <[email protected]
        <mailto:[email protected]>> a écrit :

            Dataflow tests are part of Beam post submit, and if a PR
            breaks the Dataflow runner it will probably be rolled back.
            Today Beam contributors that make changes impacting the
            runner boundary have no way to make those changes without
            breaking Dataflow (unless they as a Googler to help them).
            Fortunately these are not the most common changes, but they
            happen, and it's caused a lot of pain in the past.

            Putting this code into the github repository allows all
            runners to be modified when such a change is made, not just
            the non-Dataflow runners. This makes it easier for
            contributors and committers to make changes to Beam.

            Reuven

            On Thu, Sep 13, 2018 at 10:08 PM Romain Manni-Bucau
            <[email protected] <mailto:[email protected]>> wrote:

                Flink, Spark, Apex are usable since they are OS so you
                grab them+beam and you "run".
                If I grab dataflow worker + X OS project and "run" it is
                the same, however if I grab dataflow worker and cant do
                anything with it, the added value for Beam and users is
                pretty null, no? Just means Google should find another
                way to manage this dependency if this is the case IMHO.

                Romain Manni-Bucau
                @rmannibucau <https://twitter.com/rmannibucau> | Blog
                <https://rmannibucau.metawerx.net/> | Old Blog
                <http://rmannibucau.wordpress.com> | Github
                <https://github.com/rmannibucau> | LinkedIn
                <https://www.linkedin.com/in/rmannibucau> | Book
                
<https://www.packtpub.com/application-development/java-ee-8-high-performance>


                Le jeu. 13 sept. 2018 à 23:35, Lukasz Cwik
                <[email protected] <mailto:[email protected]>> a écrit :

                    Romain, the code is very similar to the adaptation
                    layer between the shared libraries part of Apache
                    Beam and any other runner, for example the code
                    within runners/spark or runners/apex or runners/flink.
                    If someone wanted to build an emulator of the
                    Dataflow service, they would be able to re-use them
                    but that is as impractical as writing an emulator
                    for Flink or Spark and plugging them in as the
                    dependency for runners/flink and runners/spark
                    respectively.

                    On Thu, Sep 13, 2018 at 2:07 PM Raghu Angadi
                    <[email protected] <mailto:[email protected]>> wrote:

                        On Thu, Sep 13, 2018 at 12:53 PM Romain
                        Manni-Bucau <[email protected]
                        <mailto:[email protected]>> wrote:

                            If usable by itself without google karma
                            (can you use a worker without dataflow
                            itself?) it sounds awesome otherwise it
                            sounds weird IMHO.


                        Can you elaborate a bit more on using worker
                        without dataflow? I essentially  see that as o
                        part of Dataflow runner. A runner is specific to
                        a platform.

                        I am a Googler, but commenting as a community
                        member.

                        Raghu.


                            Le jeu. 13 sept. 2018 21:36, Kai Jiang
                            <[email protected]
                            <mailto:[email protected]>> a écrit :

                                +1 (non googler)

                                big help for transparency and for future
                                runners.

                                Best,
                                Kai

                                On Thu, Sep 13, 2018, 11:45 Xinyu Liu
                                <[email protected]
                                <mailto:[email protected]>> wrote:

                                    Big +1 (non-googler).

                                     From Samza Runner's perspective, we
                                    are very happy to see dataflow
                                    worker code so we can learn and
                                    compete :).

                                    Thanks,
                                    Xinyu

                                    On Thu, Sep 13, 2018 at 11:34 AM
                                    Suneel Marthi
                                    <[email protected]
                                    <mailto:[email protected]>> wrote:

                                        +1 (non-googler)

                                        This is a great 👍 move

                                        Sent from my iPhone

                                        On Sep 13, 2018, at 2:25 PM, Tim
                                        Robertson
                                        <[email protected]
                                        <mailto:[email protected]>>
                                        wrote:

                                        +1 (non googler)
                                        It sounds pragmatic, helps
                                        with transparency should
                                        issues arise and enables more
                                        people to fix.

                                        On Thu, Sep 13, 2018 at 8:15
                                        PM Dan Halperin
                                        <[email protected]
                                        <mailto:[email protected]>>
                                        wrote:

                                            From my perspective as a
                                            (non-Google) community
                                            member, huge +1.

                                            I don't see anything bad
                                            for the community about
                                            open sourcing more of the
                                            probably-most-used runner.
                                            While the DirectRunner is
                                            probably still the most
                                            referential implementation
                                            of Beam, can't hurt to see
                                            more working code. Other
                                            runners or runner
                                            implementors can refer to
                                            this code if they want,
                                            and ignore it if they don't.

                                            In terms of having more
                                            code and tests to support,
                                            well, that's par for the
                                            course. Will this change
                                            make the things that need
                                            to be done to support them
                                            more obvious? (E.g., "this
                                            PR is blocked because
                                            someone at Google on
                                            Dataflow team has to fix
                                            something" vs "this PR is
                                            blocked because the Apache
                                            Beam code in foo/bar/baz
                                            is failing, and anyone who
                                            can see the code can fix
                                            it"). The latter seems
                                            like a clear win for the
                                            community.

                                            (As long as the code
                                            donation is handled
                                            properly, but that's
                                            completely orthogonal and
                                            I have no reason to think
                                            it wouldn't be.)

                                            Thanks,
                                            Dan

                                            On Thu, Sep 13, 2018 at
                                            11:06 AM Lukasz Cwik
                                            <[email protected]
                                            <mailto:[email protected]>>
                                            wrote:

                                                Yes, I'm specifically
                                                asking the community
                                                for opinions as to
                                                whether it should be
                                                accepted or not.

                                                On Thu, Sep 13, 2018
                                                at 10:51 AM Raghu
                                                Angadi
                                                <[email protected]
                                                <mailto:[email protected]>>
                                                wrote:

                                                    This is terrific!

                                                    Is thread asking
                                                    for opinions from
                                                    the community
                                                    about if it should
                                                    be accepted?
                                                    Assuming Google
                                                    side decision is
                                                    made to
                                                    contribute, big +1
                                                    from me to include
                                                    it next to other
                                                    runners.

                                                    On Thu, Sep 13,
                                                    2018 at 10:38 AM
                                                    Lukasz Cwik
                                                    <[email protected]
                                                    <mailto:[email protected]>>
                                                    wrote:

                                                        At Google we
                                                        have been
                                                        importing the
                                                        Apache Beam
                                                        code base and
                                                        integrating it
                                                        with the
                                                        Google portion
                                                        of the
                                                        codebase that
                                                        supports the
                                                        Dataflow
                                                        worker. This
                                                        process is
                                                        painful as we
                                                        regularly are
                                                        making
                                                        breaking API
                                                        changes to
                                                        support
                                                        libraries
                                                        related to
                                                        running
                                                        portable
                                                        pipelines (and
                                                        sometimes in
                                                        other places
                                                        as well). This
                                                        has made it
                                                        sometimes
                                                        difficult for
                                                        PR changes to
                                                        make changes
                                                        without either
                                                        breaking
                                                        something for
                                                        Google or
                                                        waiting for a
                                                        Googler to
                                                        make the
                                                        change
                                                        internally
                                                        (e.g.
                                                        dependency
                                                        updates).

                                                        This code is
                                                        very similar
                                                        to the other
                                                        integrations
                                                        that exist for
                                                        runners such
                                                        as
                                                        Flink/Spark/Apex/Samza.
                                                        It is an
                                                        adaption layer
                                                        that sits on
                                                        top of an
                                                        execution
                                                        engine. There
                                                        is no super
                                                        secret awesome
                                                        stuff as this
                                                        code was
                                                        already
                                                        publicly
                                                        visible in the
                                                        past when it
                                                        was part of
                                                        the Google
                                                        Cloud Dataflow
                                                        github repo[1].

                                                        Process wise
                                                        the code will
                                                        need to get
                                                        approval from
                                                        Google to be
                                                        donated and
                                                        for it to go
                                                        through the
                                                        code donation
                                                        process but
                                                        before we
                                                        attempt to do
                                                        that, I was
                                                        wondering
                                                        whether the
                                                        community
                                                        would object
                                                        to adding this
                                                        code to the
                                                        master branch?

                                                        The up side is
                                                        that people
                                                        can make
                                                        breaking
                                                        changes and
                                                        fix it for all
                                                        runners. It
                                                        will also help
                                                        Googlers
                                                        contribute
                                                        more to the
                                                        portability
                                                        story as it
                                                        will remove
                                                        the burden of
                                                        doing the code
                                                        import (wasted
                                                        time) and it
                                                        will allow
                                                        people to
                                                        develop in
                                                        master (can
                                                        have the whole
                                                        project loaded
                                                        in a single IDE).

                                                        The downsides
                                                        are that this
                                                        will represent
                                                        more code and
                                                        unit tests to
                                                        support.

                                                        1:
                                                        
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/tree/hotfix_v1.2/sdk/src/main/java/com/google/cloud/dataflow/sdk/runners/worker

Re: Donating the Dataflow Worker code to Apache Beam

Reply via email to