I do support all the efforts to get Dataflow, Flink, and Spark to 3 (Fn
API). But I disagree with it as a requirement; the whole point of
ptransforms with URNs is that if the runner can figure out how to execute
it according to semantics, then it is fine. A runner meets (1) and (2) but
can only run certain subset of DoFns is allowed by design (whether the
subset is based on language, state/timer support, etc).

Kenn

On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik <[email protected]> wrote:

> I would like to move away from having runners access APIs that are related
> to pipeline construction and other internal SDK APIs and I would like for
> SDKs to not inspect internal runner APIs. This would enable the community
> to improve each independently without needing to fix the world all the time
> and would enable the community to run a cluster that supports multiple Beam
> versions at the same time and would also allow for the cluster to be
> updated independently of the pipelines it runs.
>
> As a community, I believe we need to achieve 1, 2 and 3. Outside of the
> Apache Beam repo, anyone can do whatever they want but there should be no
> compatibility guarantees.
>
> 4 and 5 are extensions that enable a richer set of pipelines to run and
> are optional like many other parts such as if a runner supports metrics
> aggregation or dynamic work rebalancing.
>
> On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles <[email protected]> wrote:
>
>> There are a lot of different meanings to "portable runner". Here are some:
>>
>> (1) A runner that accepts a pipeline proto and either runs it or says it
>> cannot run it
>> (2) A runner that accepts jobs via the job management APIs
>> (3) A runner that executes UDFs via the Fn API
>> (4) A runner that can execute multiple languages
>> (5) A runner that can run cross-language transforms aka multiple
>> languages in the same pipeline
>>
>> I think (1) is a very good bar, and (2) is a nice addition on top of
>> that. Then we have a unified way to submit pipelines and understand their
>> status.
>>
>> I think (3) is optional - a runner can run things however it likes,
>> including with native implementations. And then (4) and (5) as well are
>> just levels of feature capabilities.
>>
>> Kenn
>>
>> On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik <[email protected]> wrote:
>>
>>> +1
>>>
>>> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel <[email protected]>
>>> wrote:
>>>
>>>> One last thing, for any runner after this one... wouldn't it be a good
>>>> acceptance criteria to only accept portable implementations anymore?
>>>>
>>>>  _/
>>>> _/ Alex Van Boxel
>>>>
>>>>
>>>> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía <[email protected]> wrote:
>>>>
>>>>> Good points Kenn. I think we mostly agree on what has been discussed
>>>>> in this
>>>>> thread the pros/cons of having runners on our repository, but this is
>>>>> probably
>>>>> not the best moment in time to change any policy in that aspect.
>>>>>
>>>>> So if nobody objects I think we can proceed. I am OOO this week so
>>>>> with less
>>>>> time to continue with the code review, but I will be back to finish
>>>>> the review
>>>>> and hopefully finally get this merged with Pulasthi next week (sorry
>>>>> for the
>>>>> delay).
>>>>>
>>>>> > (don't wait for me on code review - if Ismaël said it is good, then
>>>>> it is
>>>>> > good.)
>>>>>
>>>>> Thanks for your confidence. Twister2 runners looks good so far, but I
>>>>> will
>>>>> confirm 100% next week :) In the meantime if someone has some extra
>>>>> cycles to
>>>>> take a look extra feedback is always welcome.
>>>>>
>>>>> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles <[email protected]>
>>>>> wrote:
>>>>> >
>>>>> > I haven't heard anyone suggest that we need a vote. I haven't heard
>>>>> anyone object to this being merged to master. Some time ago, we mostly
>>>>> decided to favor master instead of branches, because it is so much 
>>>>> smoother
>>>>> for contributors and users.
>>>>> >
>>>>> > So I am poking this thread one last time and otherwise I would
>>>>> consider it consensus that once code review is done the runner is a part 
>>>>> of
>>>>> Beam (experimental!).
>>>>> >
>>>>> > (don't wait for me on code review - if Ismaël said it is good, then
>>>>> it is good.)
>>>>> >
>>>>> > Kenn
>>>>> >
>>>>> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe <
>>>>> [email protected]> wrote:
>>>>> >>
>>>>> >> I understand that the discussion is on a more broad level than the
>>>>> Twister2 runner. From my experience developing the runner the main
>>>>> advantage of being inside the beam project was the easy access to the wide
>>>>> range of tests and other core/utility code as Kyle pointed out. Unmerging
>>>>> runners that are not properly maintained and updated would be the most
>>>>> logical path to follow since the internals of the runners are only well
>>>>> understood by developers of that particular project. It would be
>>>>> unreasonable to expect the Beam community to maintain them. And since the
>>>>> runners do not alter the core API's I assume they would be easy to unmerge
>>>>> if the need arises.
>>>>> >>
>>>>> >> Talking specifically about Twister2 runner, we hope to continue
>>>>> developing the runner in the future to add both streaming capability and
>>>>> develop a portable runner as well. The team behind Twister2 is working
>>>>> towards the goal to get the project into Apache Incubator in the near
>>>>> future (Hopefully to submit the proposal in the next couple of months).
>>>>> >>
>>>>> >> Best Regards,
>>>>> >> Pulasthi
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw <[email protected]>
>>>>> wrote:
>>>>> >>>
>>>>> >>> I think we will get to a point where it makes sense for runners to
>>>>> >>> live in their own repositories, with their own release cadence, but
>>>>> >>> we're not at that point yet. One prerequisite is a stable
>>>>> API--we're
>>>>> >>> closing in on that with the portability protos, but many (java)
>>>>> >>> runners actually share the common runner core libraries and that is
>>>>> >>> even less set in stone.
>>>>> >>>
>>>>> >>> On the other hand, taking responsibility for maintaining all
>>>>> runners
>>>>> >>> is not a tenable or scalable position for the Beam project. If a
>>>>> >>> runner is merged, it should be understood that it can be
>>>>> "un-merged"
>>>>> >>> if it causes a maintenance burden. A completely separate
>>>>> >>> project/repository makes this less messy.
>>>>> >>>
>>>>> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles <[email protected]>
>>>>> wrote:
>>>>> >>> >
>>>>> >>> > I agree with both of you, mostly :-)
>>>>> >>> >
>>>>> >>> > The monorepo approach doesn't work/scale well for shipped
>>>>> libraries (name a Google library that silently just works and never causes
>>>>> any dependency problems) and the pain we feel has been constant and
>>>>> increasing, but I don't think we are at the breaking point.
>>>>> >>> >
>>>>> >>> > But Google's big monorepo [1] demonstrates similar benefits to
>>>>> what Kyle describes. In the early stages the benefit of not having to 
>>>>> think
>>>>> too hard about build/test infra and share it everywhere is a big help, and
>>>>> it scales well. Eventually, shipping test utility libraries and compliance
>>>>> suites can be equivalent. And to your point - it is very helpful for users
>>>>> to know that they can use CassandraIO with the other Beam artifacts. This
>>>>> is why Google requires the whole big repo to depend on a single version of
>>>>> any externally-controlled artifact. But, yes, as a consequence it is
>>>>> preposterously difficult to stay up to date, since literally anything can
>>>>> block progress. You need a unified escalation chain for that policy to 
>>>>> make
>>>>> sense. It is the definition of a healthy Apache project to *not* have that
>>>>> (PMC is different).
>>>>> >>> >
>>>>> >>> > Independent dependencies, independent git histories, and
>>>>> independent release cadence/process are all separate discussions.
>>>>> >>> >
>>>>> >>> > It is a broader question than this particular contribution, so
>>>>> let's merge this runner before changing our whole way of doing things :-)
>>>>> >>> >
>>>>> >>> > Kenn
>>>>> >>> >
>>>>> >>> > [1]
>>>>> https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
>>>>> (really quite a balanced analysis)
>>>>> >>> >
>>>>> >>> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver <[email protected]>
>>>>> wrote:
>>>>> >>> >>
>>>>> >>> >> > Should runners, current and future, be in the same repository
>>>>> as Beam
>>>>> >>> >> > core?
>>>>> >>> >>
>>>>> >>> >> In the distant past, runners lived in their own repositories,
>>>>> and then were donated to Beam. But Beam's current uber-repo setup allows a
>>>>> lot of convenience. For example, a ton of code (including core
>>>>> functionality and tests) is shared directly between runners, which is
>>>>> useful for keeping runners up to date and ensuring consistent behavior
>>>>> between them (in other words, maintainable and reliable).
>>>>> >>> >>
>>>>> >>> >> Generally, it is up to the authors of a particular Beam related
>>>>> project/subproject to decide whether to host their code in Beam or in a
>>>>> different repo, and up to the community to decide whether to take on the
>>>>> donation, as discussed in previous threads on the Twister2 runner. In this
>>>>> case, it seems there is agreement between the Twister2 runner authors and
>>>>> the community that the runner can be hosted in Beam proper.
>>>>> >>> >>
>>>>> >>> >> There are examples of successful independent Beam projects,
>>>>> such as Spotify's Scio, but having an independent project with its own
>>>>> releases requires a lot of dedicated resources, and the bar for entry for
>>>>> extending Beam should not be that high. All that's required of subproject
>>>>> authors is that they keep the subproject in step with Beam. If they can't
>>>>> maintain it any longer, the subproject can be allowed to bitrot without
>>>>> getting in anyone's way. On the other hand, I'm not sure of the details
>>>>> with Cassandra, but in general, a subproject should not have "the ability
>>>>> to block progress" just because it is contained in the Beam uber-repo.
>>>>> >>> >>
>>>>> >>> >> tl;dr Having an uber repo generally seems to work for Beam.
>>>>> Exceptions are few enough to be handled on a case-by-case basis.
>>>>> >>> >>
>>>>> >>> >> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold <
>>>>> [email protected]> wrote:
>>>>> >>> >>>
>>>>> >>> >>> Generic question without commenting on Twister2 specifically:
>>>>> >>> >>>
>>>>> >>> >>> Should runners, current and future, be in the same repository
>>>>> as Beam
>>>>> >>> >>> core? Can or should they be completely separate products with
>>>>> their
>>>>> >>> >>> own release cycles?
>>>>> >>> >>>
>>>>> >>> >>> Generally, loose coupling leads to more maintainable, reliable
>>>>> >>> >>> projects. Specifically, Cassandra is holding back some other
>>>>> changes
>>>>> >>> >>> in Beam and I really wish it didn't have the ability to block
>>>>> >>> >>> progress. The more different runners we have in core, the
>>>>> worse this
>>>>> >>> >>> problem is likely to become.
>>>>> >>> >>>
>>>>> >>> >>>
>>>>> >>> >>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
>>>>> >>> >>> <[email protected]> wrote:
>>>>> >>> >>> >
>>>>> >>> >>> > Hi
>>>>> >>> >>> >
>>>>> >>> >>> > I believe the pull request is pretty complete now with the
>>>>> help of Ismaël. Kenn, would you be able to take a look at it and suggest
>>>>> any changes if needed?. The build checks and validations tests are passing
>>>>> at the moment.  I will start working on the documentation that you
>>>>> mentioned in an earlier email separately.
>>>>> >>> >>> >
>>>>> >>> >>> > Best Regards,
>>>>> >>> >>> > Pulasthi
>>>>> >>> >>> >
>>>>> >>> >>> >
>>>>> >>> >>> >
>>>>> >>> >>> >
>>>>> >>> >>> >
>>>>> >>> >>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun
>>>>> Wickramasinghe <[email protected]> wrote:
>>>>> >>> >>> >>
>>>>> >>> >>> >> Hi All,
>>>>> >>> >>> >>
>>>>> >>> >>> >> I have created the initial pull request [1] to contribute
>>>>> the Twister2 Beam runner to the Apache Beam codebase. More information on
>>>>> Twister2 can be found here[2] and the Twister2 codebase is available
>>>>> here[3]. At the moment only batch mode is supported in the runner, but we
>>>>> are planning to add stream support and implement a portable runner for
>>>>> Twister2 in the near future.
>>>>> >>> >>> >>
>>>>> >>> >>> >> As Kenn pointed out in an earlier email it would be great
>>>>> to have inputs from the community regarding this contribution since it is 
>>>>> a
>>>>> sizable one. I am sure there are many improvements that can be done in the
>>>>> contributed codebase with input from the community.
>>>>> >>> >>> >>
>>>>> >>> >>> >> [1] https://github.com/apache/beam/pull/10888
>>>>> >>> >>> >> [2] https://twister2.org/
>>>>> >>> >>> >> [3] https://github.com/DSC-SPIDAL/twister2
>>>>> >>> >>> >>
>>>>> >>> >>> >> Best Regards,
>>>>> >>> >>> >> Pulasthi
>>>>> >>> >>> >> --
>>>>> >>> >>> >> Pulasthi S. Wickramasinghe
>>>>> >>> >>> >> PhD Candidate  | Research Assistant
>>>>> >>> >>> >> School of Informatics and Computing | Digital Science Center
>>>>> >>> >>> >> Indiana University, Bloomington
>>>>> >>> >>> >> cell: 224-386-9035 <(224)%20386-9035>
>>>>> >>> >>> >
>>>>> >>> >>> >
>>>>> >>> >>> >
>>>>> >>> >>> > --
>>>>> >>> >>> > Pulasthi S. Wickramasinghe
>>>>> >>> >>> > PhD Candidate  | Research Assistant
>>>>> >>> >>> > School of Informatics and Computing | Digital Science Center
>>>>> >>> >>> > Indiana University, Bloomington
>>>>> >>> >>> > cell: 224-386-9035 <(224)%20386-9035>
>>>>> >>> >>>
>>>>> >>> >>>
>>>>> >>> >>>
>>>>> >>> >>> --
>>>>> >>> >>> Elliotte Rusty Harold
>>>>> >>> >>> [email protected]
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Pulasthi S. Wickramasinghe
>>>>> >> PhD Candidate  | Research Assistant
>>>>> >> School of Informatics and Computing | Digital Science Center
>>>>> >> Indiana University, Bloomington
>>>>> >> cell: 224-386-9035 <(224)%20386-9035>
>>>>>
>>>>

Reply via email to