Hi Ismaël

Did you get some free time to perform a code review on the pull request

Best Regards
Pulasthi

On Tue, Mar 10, 2020 at 3:30 PM Luke Cwik <[email protected]> wrote:

> I have to disagree. Allowing for runners within the Apache Beam repo and
> SDKs that reach into the implementation details of each other are
> usability, feature development, maintenance and complexity problems.
>
> The usability issue comes from our public core facing APIs exposing
> methods that runners "need" so they can introspect details that shouldn't
> be visible to them (e.g. setWindowingStrategyInternal on PCollection).
> Getting to 1 would remove the pipeline construction time instances but not
> the execution side ones and there are currently 100+ usages of
> the @Internal annotation.
>
> The feature development and maintenance issues both stem from duplication
> of work. We need to have at least two copies of how to do something, one
> that is for runner -> SDK direct and one for Fn API. An example of this is
> the timer family work which was started and completed for the non portable
> implementation yet the portable implementation was left as future work.
>
> Finally, the complexity comes from how many layers we have that wrap
> existing components to create variants for different use cases. I'm looking
> at all the DoFnRunners and each of their variants and how those have layers
> within themselves within the SDK and how additional layers have been made
> to interface with runner specific internal details.
>
>
> On Tue, Mar 10, 2020 at 12:07 PM Kenneth Knowles <[email protected]> wrote:
>
>> I do support all the efforts to get Dataflow, Flink, and Spark to 3 (Fn
>> API). But I disagree with it as a requirement; the whole point of
>> ptransforms with URNs is that if the runner can figure out how to execute
>> it according to semantics, then it is fine. A runner meets (1) and (2) but
>> can only run certain subset of DoFns is allowed by design (whether the
>> subset is based on language, state/timer support, etc).
>>
>> Kenn
>>
>> On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik <[email protected]> wrote:
>>
>>> I would like to move away from having runners access APIs that are
>>> related to pipeline construction and other internal SDK APIs and I would
>>> like for SDKs to not inspect internal runner APIs. This would enable the
>>> community to improve each independently without needing to fix the world
>>> all the time and would enable the community to run a cluster that supports
>>> multiple Beam versions at the same time and would also allow for the
>>> cluster to be updated independently of the pipelines it runs.
>>>
>>> As a community, I believe we need to achieve 1, 2 and 3. Outside of the
>>> Apache Beam repo, anyone can do whatever they want but there should be no
>>> compatibility guarantees.
>>>
>>> 4 and 5 are extensions that enable a richer set of pipelines to run and
>>> are optional like many other parts such as if a runner supports metrics
>>> aggregation or dynamic work rebalancing.
>>>
>>> On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles <[email protected]> wrote:
>>>
>>>> There are a lot of different meanings to "portable runner". Here are
>>>> some:
>>>>
>>>> (1) A runner that accepts a pipeline proto and either runs it or says
>>>> it cannot run it
>>>> (2) A runner that accepts jobs via the job management APIs
>>>> (3) A runner that executes UDFs via the Fn API
>>>> (4) A runner that can execute multiple languages
>>>> (5) A runner that can run cross-language transforms aka multiple
>>>> languages in the same pipeline
>>>>
>>>> I think (1) is a very good bar, and (2) is a nice addition on top of
>>>> that. Then we have a unified way to submit pipelines and understand their
>>>> status.
>>>>
>>>> I think (3) is optional - a runner can run things however it likes,
>>>> including with native implementations. And then (4) and (5) as well are
>>>> just levels of feature capabilities.
>>>>
>>>> Kenn
>>>>
>>>> On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik <[email protected]> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> One last thing, for any runner after this one... wouldn't it be a
>>>>>> good acceptance criteria to only accept portable implementations anymore?
>>>>>>
>>>>>>  _/
>>>>>> _/ Alex Van Boxel
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Good points Kenn. I think we mostly agree on what has been discussed
>>>>>>> in this
>>>>>>> thread the pros/cons of having runners on our repository, but this
>>>>>>> is probably
>>>>>>> not the best moment in time to change any policy in that aspect.
>>>>>>>
>>>>>>> So if nobody objects I think we can proceed. I am OOO this week so
>>>>>>> with less
>>>>>>> time to continue with the code review, but I will be back to finish
>>>>>>> the review
>>>>>>> and hopefully finally get this merged with Pulasthi next week (sorry
>>>>>>> for the
>>>>>>> delay).
>>>>>>>
>>>>>>> > (don't wait for me on code review - if Ismaël said it is good,
>>>>>>> then it is
>>>>>>> > good.)
>>>>>>>
>>>>>>> Thanks for your confidence. Twister2 runners looks good so far, but
>>>>>>> I will
>>>>>>> confirm 100% next week :) In the meantime if someone has some extra
>>>>>>> cycles to
>>>>>>> take a look extra feedback is always welcome.
>>>>>>>
>>>>>>> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles <[email protected]>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > I haven't heard anyone suggest that we need a vote. I haven't
>>>>>>> heard anyone object to this being merged to master. Some time ago, we
>>>>>>> mostly decided to favor master instead of branches, because it is so 
>>>>>>> much
>>>>>>> smoother for contributors and users.
>>>>>>> >
>>>>>>> > So I am poking this thread one last time and otherwise I would
>>>>>>> consider it consensus that once code review is done the runner is a 
>>>>>>> part of
>>>>>>> Beam (experimental!).
>>>>>>> >
>>>>>>> > (don't wait for me on code review - if Ismaël said it is good,
>>>>>>> then it is good.)
>>>>>>> >
>>>>>>> > Kenn
>>>>>>> >
>>>>>>> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe <
>>>>>>> [email protected]> wrote:
>>>>>>> >>
>>>>>>> >> I understand that the discussion is on a more broad level than
>>>>>>> the Twister2 runner. From my experience developing the runner the main
>>>>>>> advantage of being inside the beam project was the easy access to the 
>>>>>>> wide
>>>>>>> range of tests and other core/utility code as Kyle pointed out. 
>>>>>>> Unmerging
>>>>>>> runners that are not properly maintained and updated would be the most
>>>>>>> logical path to follow since the internals of the runners are only well
>>>>>>> understood by developers of that particular project. It would be
>>>>>>> unreasonable to expect the Beam community to maintain them. And since 
>>>>>>> the
>>>>>>> runners do not alter the core API's I assume they would be easy to 
>>>>>>> unmerge
>>>>>>> if the need arises.
>>>>>>> >>
>>>>>>> >> Talking specifically about Twister2 runner, we hope to continue
>>>>>>> developing the runner in the future to add both streaming capability and
>>>>>>> develop a portable runner as well. The team behind Twister2 is working
>>>>>>> towards the goal to get the project into Apache Incubator in the near
>>>>>>> future (Hopefully to submit the proposal in the next couple of months).
>>>>>>> >>
>>>>>>> >> Best Regards,
>>>>>>> >> Pulasthi
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw <
>>>>>>> [email protected]> wrote:
>>>>>>> >>>
>>>>>>> >>> I think we will get to a point where it makes sense for runners
>>>>>>> to
>>>>>>> >>> live in their own repositories, with their own release cadence,
>>>>>>> but
>>>>>>> >>> we're not at that point yet. One prerequisite is a stable
>>>>>>> API--we're
>>>>>>> >>> closing in on that with the portability protos, but many (java)
>>>>>>> >>> runners actually share the common runner core libraries and that
>>>>>>> is
>>>>>>> >>> even less set in stone.
>>>>>>> >>>
>>>>>>> >>> On the other hand, taking responsibility for maintaining all
>>>>>>> runners
>>>>>>> >>> is not a tenable or scalable position for the Beam project. If a
>>>>>>> >>> runner is merged, it should be understood that it can be
>>>>>>> "un-merged"
>>>>>>> >>> if it causes a maintenance burden. A completely separate
>>>>>>> >>> project/repository makes this less messy.
>>>>>>> >>>
>>>>>>> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles <[email protected]>
>>>>>>> wrote:
>>>>>>> >>> >
>>>>>>> >>> > I agree with both of you, mostly :-)
>>>>>>> >>> >
>>>>>>> >>> > The monorepo approach doesn't work/scale well for shipped
>>>>>>> libraries (name a Google library that silently just works and never 
>>>>>>> causes
>>>>>>> any dependency problems) and the pain we feel has been constant and
>>>>>>> increasing, but I don't think we are at the breaking point.
>>>>>>> >>> >
>>>>>>> >>> > But Google's big monorepo [1] demonstrates similar benefits to
>>>>>>> what Kyle describes. In the early stages the benefit of not having to 
>>>>>>> think
>>>>>>> too hard about build/test infra and share it everywhere is a big help, 
>>>>>>> and
>>>>>>> it scales well. Eventually, shipping test utility libraries and 
>>>>>>> compliance
>>>>>>> suites can be equivalent. And to your point - it is very helpful for 
>>>>>>> users
>>>>>>> to know that they can use CassandraIO with the other Beam artifacts. 
>>>>>>> This
>>>>>>> is why Google requires the whole big repo to depend on a single version 
>>>>>>> of
>>>>>>> any externally-controlled artifact. But, yes, as a consequence it is
>>>>>>> preposterously difficult to stay up to date, since literally anything 
>>>>>>> can
>>>>>>> block progress. You need a unified escalation chain for that policy to 
>>>>>>> make
>>>>>>> sense. It is the definition of a healthy Apache project to *not* have 
>>>>>>> that
>>>>>>> (PMC is different).
>>>>>>> >>> >
>>>>>>> >>> > Independent dependencies, independent git histories, and
>>>>>>> independent release cadence/process are all separate discussions.
>>>>>>> >>> >
>>>>>>> >>> > It is a broader question than this particular contribution, so
>>>>>>> let's merge this runner before changing our whole way of doing things 
>>>>>>> :-)
>>>>>>> >>> >
>>>>>>> >>> > Kenn
>>>>>>> >>> >
>>>>>>> >>> > [1]
>>>>>>> https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
>>>>>>> (really quite a balanced analysis)
>>>>>>> >>> >
>>>>>>> >>> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver <
>>>>>>> [email protected]> wrote:
>>>>>>> >>> >>
>>>>>>> >>> >> > Should runners, current and future, be in the same
>>>>>>> repository as Beam
>>>>>>> >>> >> > core?
>>>>>>> >>> >>
>>>>>>> >>> >> In the distant past, runners lived in their own repositories,
>>>>>>> and then were donated to Beam. But Beam's current uber-repo setup 
>>>>>>> allows a
>>>>>>> lot of convenience. For example, a ton of code (including core
>>>>>>> functionality and tests) is shared directly between runners, which is
>>>>>>> useful for keeping runners up to date and ensuring consistent behavior
>>>>>>> between them (in other words, maintainable and reliable).
>>>>>>> >>> >>
>>>>>>> >>> >> Generally, it is up to the authors of a particular Beam
>>>>>>> related project/subproject to decide whether to host their code in Beam 
>>>>>>> or
>>>>>>> in a different repo, and up to the community to decide whether to take 
>>>>>>> on
>>>>>>> the donation, as discussed in previous threads on the Twister2 runner. 
>>>>>>> In
>>>>>>> this case, it seems there is agreement between the Twister2 runner 
>>>>>>> authors
>>>>>>> and the community that the runner can be hosted in Beam proper.
>>>>>>> >>> >>
>>>>>>> >>> >> There are examples of successful independent Beam projects,
>>>>>>> such as Spotify's Scio, but having an independent project with its own
>>>>>>> releases requires a lot of dedicated resources, and the bar for entry 
>>>>>>> for
>>>>>>> extending Beam should not be that high. All that's required of 
>>>>>>> subproject
>>>>>>> authors is that they keep the subproject in step with Beam. If they 
>>>>>>> can't
>>>>>>> maintain it any longer, the subproject can be allowed to bitrot without
>>>>>>> getting in anyone's way. On the other hand, I'm not sure of the details
>>>>>>> with Cassandra, but in general, a subproject should not have "the 
>>>>>>> ability
>>>>>>> to block progress" just because it is contained in the Beam uber-repo.
>>>>>>> >>> >>
>>>>>>> >>> >> tl;dr Having an uber repo generally seems to work for Beam.
>>>>>>> Exceptions are few enough to be handled on a case-by-case basis.
>>>>>>> >>> >>
>>>>>>> >>> >> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold <
>>>>>>> [email protected]> wrote:
>>>>>>> >>> >>>
>>>>>>> >>> >>> Generic question without commenting on Twister2 specifically:
>>>>>>> >>> >>>
>>>>>>> >>> >>> Should runners, current and future, be in the same
>>>>>>> repository as Beam
>>>>>>> >>> >>> core? Can or should they be completely separate products
>>>>>>> with their
>>>>>>> >>> >>> own release cycles?
>>>>>>> >>> >>>
>>>>>>> >>> >>> Generally, loose coupling leads to more maintainable,
>>>>>>> reliable
>>>>>>> >>> >>> projects. Specifically, Cassandra is holding back some other
>>>>>>> changes
>>>>>>> >>> >>> in Beam and I really wish it didn't have the ability to block
>>>>>>> >>> >>> progress. The more different runners we have in core, the
>>>>>>> worse this
>>>>>>> >>> >>> problem is likely to become.
>>>>>>> >>> >>>
>>>>>>> >>> >>>
>>>>>>> >>> >>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
>>>>>>> >>> >>> <[email protected]> wrote:
>>>>>>> >>> >>> >
>>>>>>> >>> >>> > Hi
>>>>>>> >>> >>> >
>>>>>>> >>> >>> > I believe the pull request is pretty complete now with the
>>>>>>> help of Ismaël. Kenn, would you be able to take a look at it and suggest
>>>>>>> any changes if needed?. The build checks and validations tests are 
>>>>>>> passing
>>>>>>> at the moment.  I will start working on the documentation that you
>>>>>>> mentioned in an earlier email separately.
>>>>>>> >>> >>> >
>>>>>>> >>> >>> > Best Regards,
>>>>>>> >>> >>> > Pulasthi
>>>>>>> >>> >>> >
>>>>>>> >>> >>> >
>>>>>>> >>> >>> >
>>>>>>> >>> >>> >
>>>>>>> >>> >>> >
>>>>>>> >>> >>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun
>>>>>>> Wickramasinghe <[email protected]> wrote:
>>>>>>> >>> >>> >>
>>>>>>> >>> >>> >> Hi All,
>>>>>>> >>> >>> >>
>>>>>>> >>> >>> >> I have created the initial pull request [1] to contribute
>>>>>>> the Twister2 Beam runner to the Apache Beam codebase. More information 
>>>>>>> on
>>>>>>> Twister2 can be found here[2] and the Twister2 codebase is available
>>>>>>> here[3]. At the moment only batch mode is supported in the runner, but 
>>>>>>> we
>>>>>>> are planning to add stream support and implement a portable runner for
>>>>>>> Twister2 in the near future.
>>>>>>> >>> >>> >>
>>>>>>> >>> >>> >> As Kenn pointed out in an earlier email it would be great
>>>>>>> to have inputs from the community regarding this contribution since it 
>>>>>>> is a
>>>>>>> sizable one. I am sure there are many improvements that can be done in 
>>>>>>> the
>>>>>>> contributed codebase with input from the community.
>>>>>>> >>> >>> >>
>>>>>>> >>> >>> >> [1] https://github.com/apache/beam/pull/10888
>>>>>>> >>> >>> >> [2] https://twister2.org/
>>>>>>> >>> >>> >> [3] https://github.com/DSC-SPIDAL/twister2
>>>>>>> >>> >>> >>
>>>>>>> >>> >>> >> Best Regards,
>>>>>>> >>> >>> >> Pulasthi
>>>>>>> >>> >>> >> --
>>>>>>> >>> >>> >> Pulasthi S. Wickramasinghe
>>>>>>> >>> >>> >> PhD Candidate  | Research Assistant
>>>>>>> >>> >>> >> School of Informatics and Computing | Digital Science
>>>>>>> Center
>>>>>>> >>> >>> >> Indiana University, Bloomington
>>>>>>> >>> >>> >> cell: 224-386-9035 <(224)%20386-9035>
>>>>>>> >>> >>> >
>>>>>>> >>> >>> >
>>>>>>> >>> >>> >
>>>>>>> >>> >>> > --
>>>>>>> >>> >>> > Pulasthi S. Wickramasinghe
>>>>>>> >>> >>> > PhD Candidate  | Research Assistant
>>>>>>> >>> >>> > School of Informatics and Computing | Digital Science
>>>>>>> Center
>>>>>>> >>> >>> > Indiana University, Bloomington
>>>>>>> >>> >>> > cell: 224-386-9035 <(224)%20386-9035>
>>>>>>> >>> >>>
>>>>>>> >>> >>>
>>>>>>> >>> >>>
>>>>>>> >>> >>> --
>>>>>>> >>> >>> Elliotte Rusty Harold
>>>>>>> >>> >>> [email protected]
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> --
>>>>>>> >> Pulasthi S. Wickramasinghe
>>>>>>> >> PhD Candidate  | Research Assistant
>>>>>>> >> School of Informatics and Computing | Digital Science Center
>>>>>>> >> Indiana University, Bloomington
>>>>>>> >> cell: 224-386-9035 <(224)%20386-9035>
>>>>>>>
>>>>>>

-- 
Pulasthi S. Wickramasinghe
PhD Candidate  | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
cell: 224-386-9035

Reply via email to