Re: Donating the Dataflow Worker code to Apache Beam

Robert Bradshaw Fri, 14 Sep 2018 01:53:24 -0700

On Fri, Sep 14, 2018 at 10:02 AM Romain Manni-Bucau <[email protected]>
wrote:


>
> Le ven. 14 sept. 2018 à 09:48, Robert Bradshaw <[email protected]> a
> écrit :
>
>> On Fri, Sep 14, 2018 at 8:00 AM Romain Manni-Bucau <[email protected]>
>> wrote:
>>
>>> Well IBM runner is outside Beam for instance so this is not really a
>>> point IMHO.
>>>
>>> My view is simple:
>>> 1. does this module bring anything to Beam as a project: I understand
>>> your answer as a no (please clarify if I'm wrong)
>>>
>>
>> As has been mentioned, this make it easier for both developers at google
>> and developers outside google to contribute, which is the immediate
>> benefit. Longer term I also hope it leads to more code sharing (there is
>> currently an unnecessary amount of duplication due to the pain of
>> developing across this boundary) including of features that aren't yet in
>> upstream runners but we'd like to see (e.g. liquid sharding).
>>
>
> This is half true, external dev will be able to contribute but not test so
> no real gain here.
>
>
>>
>>
>>> 2. does this module bring anything to Beam or Big Data users: same answer
>>>
>>
>> Dataflow is used by many Beam users, making it work well is in their
>> interest as well. That which makes contributors lives easier (and wastes
>> less of their time) will translate into more contributions (new features,
>> faster bugfixes, ...) as well.
>>
>
> Same point, if you can contribute to something you can't test without
> mocks then you still can't work on it reliably.
>

It won't solve the situation, but it'll make it much better.

Whenever you're interfacing two systems, you have to draw the line
somewhere. Rather than staying abstract, let me illustrate why in this case
moving the code (aka moving the line) is concretely better.

Currently the dataflow worker looks something like

DataflowService <---DataflowRpcs---> DataflowWorkerAdaptors
<---RunnerLibApi---> CommonWorkerCode <---SdkApi---> UserCode.

The first two components are shipped with Dataflow, and the latter two
provided by the user as their Jar. We'd like to move to the latter three
being developed and compiled together. Right now the RunnerLibApi is in
active development, and the XxxWorkerAdapters lives in the Beam repository
for most runners, but not Dataflow, making it easy to evolve along with it.
Say you want to refactor RunnerLibApi to add a parameter to a method. It's
an easy change, possibly mostly automated with an IDE. Now the postcommits
go and test this against the dataflow runner and someone gets a class
loader or method not found exception reproducible only on some worker on
the dataflow service (which is much more painful than the compiler error
(no need for mocks even) you would have gotten locally during development
earlier in the cycle). So you have to roll this back and do this in a
two-step phase, adding the old method as a delegate, getting someone at
Google to move Dataflow from using the old method to the new method, then
when that's all pushed out deleting the old method. And this is for a
fairly simple change. (And in too many cases this last step or two is not
carried through, or hackier workarounds are chosen to avoid this process
entirely, which leaves cruft in the Beam codebase.)

On the other hand, the DataflowRpcs have been stable for a long time, and
when changes are made, they're almost always made by Googlers who are in a
position to pay this price.

The crux of the problem is that RunnerLibApi is not (yet) a stable API.


> So at the end this will not bring anything to the community and just solve
>>> an google internal design issue so why should it hit Beam?
>>> I get the "we can't test it" point but this is wrong since you can use
>>> snapshots and staging repos, if not the enhancement is trivial enough to
>>> make it doable and not add a dead module to beam tree.
>>>
>>> Am I missing anything?
>>>
>>
>> While it's true we *can* test this without it being in Beam, as we have
>> been doing, it's painful. It's like doing away with presubmits and only
>> relying on postsubmits, but where you can't even look at the failure and
>> fix it on your local machine. It's a huge time sink for all those involved,
>> and not good for transparency or openness (e.g. there are things that only
>> googlers can do).
>>
>
> This is the case for any vendor impl based on Beam since by design the
> dependency is in this direction.
>
>
>>
>> As has been mentioned, we already do this for Flink, Spark, etc. There's
>> also a precedent for providing connectors to even non-OSS systems, e.g. we
>> ship the job submission portions for Dataflow, IO connectors for Apache
>> Kenisis, and an S3 filesystem adapter. It certainly wouldn't be in our, or
>> our users, benefit to remove those.
>>
>
> Agree but these are modules which are touching users directly. I.e. if you
> have some S3 bucket you grab the module and run it on your database. In the
> worker case, you will never do it.
>
>
>>
>> Eventually, as has been mentioned on the other thread, I hope our
>> interfaces become stable enough that it's easy to move much if not all of
>> this into the respective upstream projects. But that is certainly not the
>> case right now.
>>
>
> This is likely where investment must be made instead of working it around
> making beam bigger, increase its maintenance cost for the community without
> real gain and harder to enter in.
>
>>
Yes, this is definitely the direction I'd like to see things go. But right
now a huge amount of development is happening in the runner <-> worker
interface which is still very much in flux, so I don't think that's
feasible to pursue until things settle down here.

This is why I'm in favor of this move (as well as other benefits like
transparency and more potential for code sharing).



> Hopefully this helps answer your questions as to the benefits for Beam.
>>
>>
>>> Le ven. 14 sept. 2018 à 07:22, Reuven Lax <[email protected]> a écrit :
>>>
>>>> Dataflow tests are part of Beam post submit, and if a PR breaks the
>>>> Dataflow runner it will probably be rolled back. Today Beam contributors
>>>> that make changes impacting the runner boundary have no way to make those
>>>> changes without breaking Dataflow (unless they as a Googler to help them).
>>>> Fortunately these are not the most common changes, but they happen, and
>>>> it's caused a lot of pain in the past.
>>>>
>>>> Putting this code into the github repository allows all runners to be
>>>> modified when such a change is made, not just the non-Dataflow runners.
>>>> This makes it easier for contributors and committers to make changes to
>>>> Beam.
>>>>
>>>> Reuven
>>>>
>>>> On Thu, Sep 13, 2018 at 10:08 PM Romain Manni-Bucau <
>>>> [email protected]> wrote:
>>>>
>>>>> Flink, Spark, Apex are usable since they are OS so you grab them+beam
>>>>> and you "run".
>>>>> If I grab dataflow worker + X OS project and "run" it is the same,
>>>>> however if I grab dataflow worker and cant do anything with it, the added
>>>>> value for Beam and users is pretty null, no? Just means Google should find
>>>>> another way to manage this dependency if this is the case IMHO.
>>>>>
>>>>> Romain Manni-Bucau
>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>
>>>>>
>>>>> Le jeu. 13 sept. 2018 à 23:35, Lukasz Cwik <[email protected]> a
>>>>> écrit :
>>>>>
>>>>>> Romain, the code is very similar to the adaptation layer between the
>>>>>> shared libraries part of Apache Beam and any other runner, for example 
>>>>>> the
>>>>>> code within runners/spark or runners/apex or runners/flink.
>>>>>> If someone wanted to build an emulator of the Dataflow service, they
>>>>>> would be able to re-use them but that is as impractical as writing an
>>>>>> emulator for Flink or Spark and plugging them in as the dependency for
>>>>>> runners/flink and runners/spark respectively.
>>>>>>
>>>>>> On Thu, Sep 13, 2018 at 2:07 PM Raghu Angadi <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> On Thu, Sep 13, 2018 at 12:53 PM Romain Manni-Bucau <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> If usable by itself without google karma (can you use a worker
>>>>>>>> without dataflow itself?) it sounds awesome otherwise it sounds weird 
>>>>>>>> IMHO.
>>>>>>>>
>>>>>>>
>>>>>>> Can you elaborate a bit more on using worker without dataflow? I
>>>>>>> essentially  see that as o part of Dataflow runner. A runner is 
>>>>>>> specific to
>>>>>>> a platform.
>>>>>>>
>>>>>>> I am a Googler, but commenting as a community member.
>>>>>>>
>>>>>>> Raghu.
>>>>>>>
>>>>>>>>
>>>>>>>> Le jeu. 13 sept. 2018 21:36, Kai Jiang <[email protected]> a
>>>>>>>> écrit :
>>>>>>>>
>>>>>>>>> +1 (non googler)
>>>>>>>>>
>>>>>>>>> big help for transparency and for future runners.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Kai
>>>>>>>>>
>>>>>>>>> On Thu, Sep 13, 2018, 11:45 Xinyu Liu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Big +1 (non-googler).
>>>>>>>>>>
>>>>>>>>>> From Samza Runner's perspective, we are very happy to see
>>>>>>>>>> dataflow worker code so we can learn and compete :).
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Xinyu
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 13, 2018 at 11:34 AM Suneel Marthi <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1 (non-googler)
>>>>>>>>>>>
>>>>>>>>>>> This is a great 👍 move
>>>>>>>>>>>
>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>
>>>>>>>>>>> On Sep 13, 2018, at 2:25 PM, Tim Robertson <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> +1 (non googler)
>>>>>>>>>>> It sounds pragmatic, helps with transparency should issues arise
>>>>>>>>>>> and enables more people to fix.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Sep 13, 2018 at 8:15 PM Dan Halperin <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> From my perspective as a (non-Google) community member, huge +1.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't see anything bad for the community about open sourcing
>>>>>>>>>>>> more of the probably-most-used runner. While the DirectRunner is 
>>>>>>>>>>>> probably
>>>>>>>>>>>> still the most referential implementation of Beam, can't hurt to 
>>>>>>>>>>>> see more
>>>>>>>>>>>> working code. Other runners or runner implementors can refer to 
>>>>>>>>>>>> this code
>>>>>>>>>>>> if they want, and ignore it if they don't.
>>>>>>>>>>>>
>>>>>>>>>>>> In terms of having more code and tests to support, well, that's
>>>>>>>>>>>> par for the course. Will this change make the things that need to 
>>>>>>>>>>>> be done
>>>>>>>>>>>> to support them more obvious? (E.g., "this PR is blocked because 
>>>>>>>>>>>> someone at
>>>>>>>>>>>> Google on Dataflow team has to fix something" vs "this PR is 
>>>>>>>>>>>> blocked
>>>>>>>>>>>> because the Apache Beam code in foo/bar/baz is failing, and anyone 
>>>>>>>>>>>> who can
>>>>>>>>>>>> see the code can fix it"). The latter seems like a clear win for 
>>>>>>>>>>>> the
>>>>>>>>>>>> community.
>>>>>>>>>>>>
>>>>>>>>>>>> (As long as the code donation is handled properly, but that's
>>>>>>>>>>>> completely orthogonal and I have no reason to think it wouldn't 
>>>>>>>>>>>> be.)
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Dan
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Sep 13, 2018 at 11:06 AM Lukasz Cwik <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, I'm specifically asking the community for opinions as to
>>>>>>>>>>>>> whether it should be accepted or not.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Sep 13, 2018 at 10:51 AM Raghu Angadi <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is terrific!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is thread asking for opinions from the community about if it
>>>>>>>>>>>>>> should be accepted? Assuming Google side decision is made to 
>>>>>>>>>>>>>> contribute,
>>>>>>>>>>>>>> big +1 from me to include it next to other runners.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Sep 13, 2018 at 10:38 AM Lukasz Cwik <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> At Google we have been importing the Apache Beam code base
>>>>>>>>>>>>>>> and integrating it with the Google portion of the codebase that 
>>>>>>>>>>>>>>> supports
>>>>>>>>>>>>>>> the Dataflow worker. This process is painful as we regularly 
>>>>>>>>>>>>>>> are making
>>>>>>>>>>>>>>> breaking API changes to support libraries related to running 
>>>>>>>>>>>>>>> portable
>>>>>>>>>>>>>>> pipelines (and sometimes in other places as well). This has 
>>>>>>>>>>>>>>> made it
>>>>>>>>>>>>>>> sometimes difficult for PR changes to make changes without 
>>>>>>>>>>>>>>> either breaking
>>>>>>>>>>>>>>> something for Google or waiting for a Googler to make the 
>>>>>>>>>>>>>>> change internally
>>>>>>>>>>>>>>> (e.g. dependency updates).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This code is very similar to the other integrations that
>>>>>>>>>>>>>>> exist for runners such as Flink/Spark/Apex/Samza. It is an 
>>>>>>>>>>>>>>> adaption layer
>>>>>>>>>>>>>>> that sits on top of an execution engine. There is no super 
>>>>>>>>>>>>>>> secret awesome
>>>>>>>>>>>>>>> stuff as this code was already publicly visible in the past 
>>>>>>>>>>>>>>> when it was
>>>>>>>>>>>>>>> part of the Google Cloud Dataflow github repo[1].
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Process wise the code will need to get approval from Google
>>>>>>>>>>>>>>> to be donated and for it to go through the code donation 
>>>>>>>>>>>>>>> process but before
>>>>>>>>>>>>>>> we attempt to do that, I was wondering whether the community 
>>>>>>>>>>>>>>> would object
>>>>>>>>>>>>>>> to adding this code to the master branch?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The up side is that people can make breaking changes and fix
>>>>>>>>>>>>>>> it for all runners. It will also help Googlers contribute more 
>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>> portability story as it will remove the burden of doing the 
>>>>>>>>>>>>>>> code import
>>>>>>>>>>>>>>> (wasted time) and it will allow people to develop in master 
>>>>>>>>>>>>>>> (can have the
>>>>>>>>>>>>>>> whole project loaded in a single IDE).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The downsides are that this will represent more code and
>>>>>>>>>>>>>>> unit tests to support.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1:
>>>>>>>>>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/tree/hotfix_v1.2/sdk/src/main/java/com/google/cloud/dataflow/sdk/runners/worker
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Re: Donating the Dataflow Worker code to Apache Beam

Reply via email to