Re: Donating the Dataflow Worker code to Apache Beam

Romain Manni-Bucau Thu, 13 Sep 2018 23:01:20 -0700

Well IBM runner is outside Beam for instance so this is not really a point
IMHO.


My view is simple:
1. does this module bring anything to Beam as a project: I understand your
answer as a no (please clarify if I'm wrong)
2. does this module bring anything to Beam or Big Data users: same answer

So at the end this will not bring anything to the community and just solve
an google internal design issue so why should it hit Beam?
I get the "we can't test it" point but this is wrong since you can use
snapshots and staging repos, if not the enhancement is trivial enough to
make it doable and not add a dead module to beam tree.

Am I missing anything?

Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>


Le ven. 14 sept. 2018 à 07:22, Reuven Lax <re...@google.com> a écrit :

> Dataflow tests are part of Beam post submit, and if a PR breaks the
> Dataflow runner it will probably be rolled back. Today Beam contributors
> that make changes impacting the runner boundary have no way to make those
> changes without breaking Dataflow (unless they as a Googler to help them).
> Fortunately these are not the most common changes, but they happen, and
> it's caused a lot of pain in the past.
>
> Putting this code into the github repository allows all runners to be
> modified when such a change is made, not just the non-Dataflow runners.
> This makes it easier for contributors and committers to make changes to
> Beam.
>
> Reuven
>
> On Thu, Sep 13, 2018 at 10:08 PM Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> Flink, Spark, Apex are usable since they are OS so you grab them+beam and
>> you "run".
>> If I grab dataflow worker + X OS project and "run" it is the same,
>> however if I grab dataflow worker and cant do anything with it, the added
>> value for Beam and users is pretty null, no? Just means Google should find
>> another way to manage this dependency if this is the case IMHO.
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>>
>> Le jeu. 13 sept. 2018 à 23:35, Lukasz Cwik <lc...@google.com> a écrit :
>>
>>> Romain, the code is very similar to the adaptation layer between the
>>> shared libraries part of Apache Beam and any other runner, for example the
>>> code within runners/spark or runners/apex or runners/flink.
>>> If someone wanted to build an emulator of the Dataflow service, they
>>> would be able to re-use them but that is as impractical as writing an
>>> emulator for Flink or Spark and plugging them in as the dependency for
>>> runners/flink and runners/spark respectively.
>>>
>>> On Thu, Sep 13, 2018 at 2:07 PM Raghu Angadi <rang...@google.com> wrote:
>>>
>>>> On Thu, Sep 13, 2018 at 12:53 PM Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>>> If usable by itself without google karma (can you use a worker without
>>>>> dataflow itself?) it sounds awesome otherwise it sounds weird IMHO.
>>>>>
>>>>
>>>> Can you elaborate a bit more on using worker without dataflow? I
>>>> essentially  see that as o part of Dataflow runner. A runner is specific to
>>>> a platform.
>>>>
>>>> I am a Googler, but commenting as a community member.
>>>>
>>>> Raghu.
>>>>
>>>>>
>>>>> Le jeu. 13 sept. 2018 21:36, Kai Jiang <jiang...@gmail.com> a écrit :
>>>>>
>>>>>> +1 (non googler)
>>>>>>
>>>>>> big help for transparency and for future runners.
>>>>>>
>>>>>> Best,
>>>>>> Kai
>>>>>>
>>>>>> On Thu, Sep 13, 2018, 11:45 Xinyu Liu <xinyuliu...@gmail.com> wrote:
>>>>>>
>>>>>>> Big +1 (non-googler).
>>>>>>>
>>>>>>> From Samza Runner's perspective, we are very happy to see dataflow
>>>>>>> worker code so we can learn and compete :).
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Xinyu
>>>>>>>
>>>>>>> On Thu, Sep 13, 2018 at 11:34 AM Suneel Marthi <
>>>>>>> suneel.mar...@gmail.com> wrote:
>>>>>>>
>>>>>>>> +1 (non-googler)
>>>>>>>>
>>>>>>>> This is a great 👍 move
>>>>>>>>
>>>>>>>> Sent from my iPhone
>>>>>>>>
>>>>>>>> On Sep 13, 2018, at 2:25 PM, Tim Robertson <
>>>>>>>> timrobertson...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> +1 (non googler)
>>>>>>>> It sounds pragmatic, helps with transparency should issues arise
>>>>>>>> and enables more people to fix.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Sep 13, 2018 at 8:15 PM Dan Halperin <dhalp...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> From my perspective as a (non-Google) community member, huge +1.
>>>>>>>>>
>>>>>>>>> I don't see anything bad for the community about open sourcing
>>>>>>>>> more of the probably-most-used runner. While the DirectRunner is 
>>>>>>>>> probably
>>>>>>>>> still the most referential implementation of Beam, can't hurt to see 
>>>>>>>>> more
>>>>>>>>> working code. Other runners or runner implementors can refer to this 
>>>>>>>>> code
>>>>>>>>> if they want, and ignore it if they don't.
>>>>>>>>>
>>>>>>>>> In terms of having more code and tests to support, well, that's
>>>>>>>>> par for the course. Will this change make the things that need to be 
>>>>>>>>> done
>>>>>>>>> to support them more obvious? (E.g., "this PR is blocked because 
>>>>>>>>> someone at
>>>>>>>>> Google on Dataflow team has to fix something" vs "this PR is blocked
>>>>>>>>> because the Apache Beam code in foo/bar/baz is failing, and anyone 
>>>>>>>>> who can
>>>>>>>>> see the code can fix it"). The latter seems like a clear win for the
>>>>>>>>> community.
>>>>>>>>>
>>>>>>>>> (As long as the code donation is handled properly, but that's
>>>>>>>>> completely orthogonal and I have no reason to think it wouldn't be.)
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Dan
>>>>>>>>>
>>>>>>>>> On Thu, Sep 13, 2018 at 11:06 AM Lukasz Cwik <lc...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Yes, I'm specifically asking the community for opinions as to
>>>>>>>>>> whether it should be accepted or not.
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 13, 2018 at 10:51 AM Raghu Angadi <rang...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> This is terrific!
>>>>>>>>>>>
>>>>>>>>>>> Is thread asking for opinions from the community about if it
>>>>>>>>>>> should be accepted? Assuming Google side decision is made to 
>>>>>>>>>>> contribute,
>>>>>>>>>>> big +1 from me to include it next to other runners.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Sep 13, 2018 at 10:38 AM Lukasz Cwik <lc...@google.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> At Google we have been importing the Apache Beam code base and
>>>>>>>>>>>> integrating it with the Google portion of the codebase that 
>>>>>>>>>>>> supports the
>>>>>>>>>>>> Dataflow worker. This process is painful as we regularly are making
>>>>>>>>>>>> breaking API changes to support libraries related to running 
>>>>>>>>>>>> portable
>>>>>>>>>>>> pipelines (and sometimes in other places as well). This has made it
>>>>>>>>>>>> sometimes difficult for PR changes to make changes without either 
>>>>>>>>>>>> breaking
>>>>>>>>>>>> something for Google or waiting for a Googler to make the change 
>>>>>>>>>>>> internally
>>>>>>>>>>>> (e.g. dependency updates).
>>>>>>>>>>>>
>>>>>>>>>>>> This code is very similar to the other integrations that exist
>>>>>>>>>>>> for runners such as Flink/Spark/Apex/Samza. It is an adaption 
>>>>>>>>>>>> layer that
>>>>>>>>>>>> sits on top of an execution engine. There is no super secret 
>>>>>>>>>>>> awesome stuff
>>>>>>>>>>>> as this code was already publicly visible in the past when it was 
>>>>>>>>>>>> part of
>>>>>>>>>>>> the Google Cloud Dataflow github repo[1].
>>>>>>>>>>>>
>>>>>>>>>>>> Process wise the code will need to get approval from Google to
>>>>>>>>>>>> be donated and for it to go through the code donation process but 
>>>>>>>>>>>> before we
>>>>>>>>>>>> attempt to do that, I was wondering whether the community would 
>>>>>>>>>>>> object to
>>>>>>>>>>>> adding this code to the master branch?
>>>>>>>>>>>>
>>>>>>>>>>>> The up side is that people can make breaking changes and fix it
>>>>>>>>>>>> for all runners. It will also help Googlers contribute more to the
>>>>>>>>>>>> portability story as it will remove the burden of doing the code 
>>>>>>>>>>>> import
>>>>>>>>>>>> (wasted time) and it will allow people to develop in master (can 
>>>>>>>>>>>> have the
>>>>>>>>>>>> whole project loaded in a single IDE).
>>>>>>>>>>>>
>>>>>>>>>>>> The downsides are that this will represent more code and unit
>>>>>>>>>>>> tests to support.
>>>>>>>>>>>>
>>>>>>>>>>>> 1:
>>>>>>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/tree/hotfix_v1.2/sdk/src/main/java/com/google/cloud/dataflow/sdk/runners/worker
>>>>>>>>>>>>
>>>>>>>>>>>

Re: Donating the Dataflow Worker code to Apache Beam

Reply via email to