Re: Donating the Dataflow Worker code to Apache Beam

Lukasz Cwik Thu, 13 Sep 2018 14:35:17 -0700

Romain, the code is very similar to the adaptation layer between the shared
libraries part of Apache Beam and any other runner, for example the code
within runners/spark or runners/apex or runners/flink.
If someone wanted to build an emulator of the Dataflow service, they would
be able to re-use them but that is as impractical as writing an emulator
for Flink or Spark and plugging them in as the dependency for runners/flink
and runners/spark respectively.


On Thu, Sep 13, 2018 at 2:07 PM Raghu Angadi <rang...@google.com> wrote:

> On Thu, Sep 13, 2018 at 12:53 PM Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> If usable by itself without google karma (can you use a worker without
>> dataflow itself?) it sounds awesome otherwise it sounds weird IMHO.
>>
>
> Can you elaborate a bit more on using worker without dataflow? I
> essentially  see that as o part of Dataflow runner. A runner is specific to
> a platform.
>
> I am a Googler, but commenting as a community member.
>
> Raghu.
>
>>
>> Le jeu. 13 sept. 2018 21:36, Kai Jiang <jiang...@gmail.com> a écrit :
>>
>>> +1 (non googler)
>>>
>>> big help for transparency and for future runners.
>>>
>>> Best,
>>> Kai
>>>
>>> On Thu, Sep 13, 2018, 11:45 Xinyu Liu <xinyuliu...@gmail.com> wrote:
>>>
>>>> Big +1 (non-googler).
>>>>
>>>> From Samza Runner's perspective, we are very happy to see dataflow
>>>> worker code so we can learn and compete :).
>>>>
>>>> Thanks,
>>>> Xinyu
>>>>
>>>> On Thu, Sep 13, 2018 at 11:34 AM Suneel Marthi <suneel.mar...@gmail.com>
>>>> wrote:
>>>>
>>>>> +1 (non-googler)
>>>>>
>>>>> This is a great 👍 move
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Sep 13, 2018, at 2:25 PM, Tim Robertson <timrobertson...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> +1 (non googler)
>>>>> It sounds pragmatic, helps with transparency should issues arise and
>>>>> enables more people to fix.
>>>>>
>>>>>
>>>>> On Thu, Sep 13, 2018 at 8:15 PM Dan Halperin <dhalp...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> From my perspective as a (non-Google) community member, huge +1.
>>>>>>
>>>>>> I don't see anything bad for the community about open sourcing more
>>>>>> of the probably-most-used runner. While the DirectRunner is probably 
>>>>>> still
>>>>>> the most referential implementation of Beam, can't hurt to see more 
>>>>>> working
>>>>>> code. Other runners or runner implementors can refer to this code if they
>>>>>> want, and ignore it if they don't.
>>>>>>
>>>>>> In terms of having more code and tests to support, well, that's par
>>>>>> for the course. Will this change make the things that need to be done to
>>>>>> support them more obvious? (E.g., "this PR is blocked because someone at
>>>>>> Google on Dataflow team has to fix something" vs "this PR is blocked
>>>>>> because the Apache Beam code in foo/bar/baz is failing, and anyone who 
>>>>>> can
>>>>>> see the code can fix it"). The latter seems like a clear win for the
>>>>>> community.
>>>>>>
>>>>>> (As long as the code donation is handled properly, but that's
>>>>>> completely orthogonal and I have no reason to think it wouldn't be.)
>>>>>>
>>>>>> Thanks,
>>>>>> Dan
>>>>>>
>>>>>> On Thu, Sep 13, 2018 at 11:06 AM Lukasz Cwik <lc...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes, I'm specifically asking the community for opinions as to
>>>>>>> whether it should be accepted or not.
>>>>>>>
>>>>>>> On Thu, Sep 13, 2018 at 10:51 AM Raghu Angadi <rang...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> This is terrific!
>>>>>>>>
>>>>>>>> Is thread asking for opinions from the community about if it should
>>>>>>>> be accepted? Assuming Google side decision is made to contribute, big 
>>>>>>>> +1
>>>>>>>> from me to include it next to other runners.
>>>>>>>>
>>>>>>>> On Thu, Sep 13, 2018 at 10:38 AM Lukasz Cwik <lc...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> At Google we have been importing the Apache Beam code base and
>>>>>>>>> integrating it with the Google portion of the codebase that supports 
>>>>>>>>> the
>>>>>>>>> Dataflow worker. This process is painful as we regularly are making
>>>>>>>>> breaking API changes to support libraries related to running portable
>>>>>>>>> pipelines (and sometimes in other places as well). This has made it
>>>>>>>>> sometimes difficult for PR changes to make changes without either 
>>>>>>>>> breaking
>>>>>>>>> something for Google or waiting for a Googler to make the change 
>>>>>>>>> internally
>>>>>>>>> (e.g. dependency updates).
>>>>>>>>>
>>>>>>>>> This code is very similar to the other integrations that exist for
>>>>>>>>> runners such as Flink/Spark/Apex/Samza. It is an adaption layer that 
>>>>>>>>> sits
>>>>>>>>> on top of an execution engine. There is no super secret awesome stuff 
>>>>>>>>> as
>>>>>>>>> this code was already publicly visible in the past when it was part 
>>>>>>>>> of the
>>>>>>>>> Google Cloud Dataflow github repo[1].
>>>>>>>>>
>>>>>>>>> Process wise the code will need to get approval from Google to be
>>>>>>>>> donated and for it to go through the code donation process but before 
>>>>>>>>> we
>>>>>>>>> attempt to do that, I was wondering whether the community would 
>>>>>>>>> object to
>>>>>>>>> adding this code to the master branch?
>>>>>>>>>
>>>>>>>>> The up side is that people can make breaking changes and fix it
>>>>>>>>> for all runners. It will also help Googlers contribute more to the
>>>>>>>>> portability story as it will remove the burden of doing the code 
>>>>>>>>> import
>>>>>>>>> (wasted time) and it will allow people to develop in master (can have 
>>>>>>>>> the
>>>>>>>>> whole project loaded in a single IDE).
>>>>>>>>>
>>>>>>>>> The downsides are that this will represent more code and unit
>>>>>>>>> tests to support.
>>>>>>>>>
>>>>>>>>> 1:
>>>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/tree/hotfix_v1.2/sdk/src/main/java/com/google/cloud/dataflow/sdk/runners/worker
>>>>>>>>>
>>>>>>>>

Re: Donating the Dataflow Worker code to Apache Beam

Reply via email to